Getting ready

Let's start with a house. A house may have the following dimensions:

  • Area
  • Lot size
  • Number of rooms

We are working in three-dimensional space here. Thus, the interpretation of point (4500, 41000, 4) would be 4500 sq. ft area, 41k sq. ft lot size, and four rooms.

Points and vectors are the same thing. Dimensions in vectors are called features. In another way, we can define a feature as an individual measurable property of a phenomenon being observed.

Spark has local vectors and matrices and also distributed matrices. A distributed matrix is backed by one or more RDDs. A local vector has numeric indices and double values and is stored on a single machine.

There are two types of local vectors in MLlib: dense and sparse. A dense vector is backed by an array of its values, while a sparse vector is backed by two parallel arrays, one for indices and another for values.

So, house data (4500410004) will be represented as [4500410004] using dense vector and as (3, [0, 1, 2], [4500.041000.04.0]) using sparse vector format.

Whether to make a vector sparse or dense depends upon how many null values or 0s it has. Let's take the case of a vector with 10,000 values, 9,000 of them being 0. If we use the dense vector format, it would be a simple structure, but 90 per cent of the space will be wasted. The sparse vector format would work out better here as it would only keep indices, which are non-zero.

Sparse data is very common, and Spark now natively supports the libsvm format for it, which stores one feature vector per line.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset