The following table provides a list of algorithms supported by MLlib with classifications such as the type of machine learning and the type of algorithm:
Let's understand these algorithms now.
Supervised learning deals with labeled training data. For example, historical e-mail training data will have e-mails marked as ham or spam. This data is used to train a model that can predict and classify future e-mails as ham or spam. Supervised learning problems can be broadly categorized into two major types—classification and regression:
Classification predicts categorical variables or classes. A couple of examples are spam detection and predicting customer churn. This target variable is discrete and has a predefined set of values. The classification algorithms are as follows:
Regression deals with a target variable and is continuous. For example, to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. The regression algorithms are as follows:
Unsupervised learning deals with unlabeled data. The objective is to observe structure in the data and find patterns. Tasks such as cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems.
The clustering algorithms are as follows:
Dimensionality Reduction algorithms aim to reduce the number of features under consideration. This reduces noise in the data and focuses on key features. This type of algorithms include the following:
Recommender systems are used to recommend products or information to users. Examples are video recommendations on YouTube or Netflix.
Collaborative filtering forms the basis for recommender systems. It creates a user-item association matrix and aims to fill the gaps. Based on other users and items, along with their ratings, it recommends an item that the target user has no ratings for. In Spark, one of the most useful algorithms is Alternating Least Squares, which is described as follows:
Feature extraction and transformation are essential techniques to process large text documents and other datasets. It has the following techniques:
Optimization algorithms of MLlib focus on techniques of gradient descent. Spark provides an implementation of gradient descent on a distributed cluster of machines. This is compute-intensive as it iterates through all the data available. The optimization algorithms include the following:
MLlib supports four data types used in MLlib algorithms—local vector, labeled point, local matrix, and distributed matrix:
We need to install numpy
to work with MLlib data types and algorithms, so use the following commands to install numpy
:
[cloudera@quickstart ~]$ sudo yum -y install python-pip [cloudera@quickstart ~]$ sudo pip install --upgrade pip [cloudera@quickstart ~]$ sudo pip install 'numpy==1.8.1' [cloudera@quickstart ~]$ sudo pip install 'scipy==0.9.0'
A dense vector is a traditional array of doubles:
>>> import numpy as np >>> import scipy.sparse as sps >>> from pyspark.mllib.linalg import Vectors >>> dv1 = np.array([2.0, 0.0, 5.0]) >>> dv1 array([ 2., 0., 5.])
A sparse vector uses integer indices and double values:
>>> sv1 = Vectors.sparse(2, [0, 3], [5.0, 1.0]) >>> sv1 SparseVector(2, {0: 5.0, 3: 1.0})
>>> from pyspark.mllib.linalg import SparseVector >>> from pyspark.mllib.regression import LabeledPoint # Labeled point with a positive label and a dense feature vector >>> lp_pos = LabeledPoint(1.0, [4.0, 0.0, 2.0]) >>> lp_pos LabeledPoint(1.0, [4.0,0.0,2.0]) # Labeled point with a negative label and a sparse feature vector >>> lp_neg = LabeledPoint(0.0, SparseVector(5, [1, 2], [3.0, 5.0])) >>> lp_neg LabeledPoint(0.0, (5,[1,2],[3.0,5.0]))
from pyspark.mllib.linalg import Matrix, Matrices # Dense matrix ((1.0, 2.0, 3.0), (4.0, 5.0, 6.0)) dMatrix = Matrices.dense(2, 3, [1, 2, 3, 4, 5, 6]) # Sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0)) sMatrix = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
RowMatrix
, IndexedRowMatrix
, CoordinateMatrix
, and BlockMatrix
:RowMatrix
: This is a distributed matrix of rows with meaningless indices created from an RDD of vectors.IndexedRowMatrix
: Row indices are meaningful in this matrix. The RDD is created with indexed rows using the IndexedRow
class and then IndexedRowMatrix
is created.CoordinateMatrix
: This matrix is useful for very large and sparse matrices. CoordinateMatrix
is created from RDDs of the MatrixEntry
points, represented by a tuple of the long
or float
type.BlockMatrix
: BlockMatrix
is created from RDDs of sub-matrix blocks.