Machine learning algorithms

The following table provides a list of algorithms supported by MLlib with classifications such as the type of machine learning and the type of algorithm:

Type of machine learning

Type of algorithm

Algorithm name

Supervised learning

Classification

Naive Bayes

Decision Trees

Random Forests

Gradient-Boosted Trees

Regression

Linear Regression

Logistic Regression

Support Vector Machines

Unsupervised learning

Clustering

K-Means

Gaussian mixture

Power Iteration Clustering (PIC)

Latent Dirichlet Allocation (LDA)

Streaming k-means

Dimensionality reduction

Singular Value Decomposition (SVD)

Principal Component Analysis (PCA)

Recommender systems

Collaborative filtering

User-based collaborative filtering

Item-based collaborative filtering

Alternating Least Squares (ALS)

Feature extraction

Feature extraction and transformation

TF-IDF

Word2Vec

Standard Scaler

Normalizer

Chi-Square Selector

Optimization

Optimization

Stochastic Gradient Descent

Limited-memory BFGS

Let's understand these algorithms now.

Supervised learning

Supervised learning deals with labeled training data. For example, historical e-mail training data will have e-mails marked as ham or spam. This data is used to train a model that can predict and classify future e-mails as ham or spam. Supervised learning problems can be broadly categorized into two major types—classification and regression:

Classification predicts categorical variables or classes. A couple of examples are spam detection and predicting customer churn. This target variable is discrete and has a predefined set of values. The classification algorithms are as follows:

  • Naive Bayes: This algorithm makes predictions based on the conditional probability distribution of a label given an observation. This assumes that features are mutually independent of each other.
  • Decision Trees: This algorithm uses a decision tree as a predictive model, which maps observations about an item to conclusions about the item's target value.
  • Ensembles of trees (Random Forests and Gradient-Boosted Trees): Ensemble algorithms combine base decision tree models in order to build a robust model. They are intuitive and very successful for classification and regression tasks.

Regression deals with a target variable and is continuous. For example, to predict house prices, the target variable price is continuous and doesn't have a predefined set of values. The regression algorithms are as follows:

  • Regression Models (Linear Regression, Logistic Regression, and Support Vector Machines): Regression algorithms are expressed as convex optimization problems aiming to minimize an objective function based on a vector of weight variables. An objective function controls the complexity of the model through the regularized part of the function, and the error of the model through the loss part of the function.

Unsupervised learning

Unsupervised learning deals with unlabeled data. The objective is to observe structure in the data and find patterns. Tasks such as cluster analysis, association rule mining, outlier detection, dimensionality reduction, and so on can be modeled as unsupervised learning problems.

The clustering algorithms are as follows:

  • K-Means: This is the task of grouping similar objects (called a cluster) together to partition n observations into k clusters, for example, grouping similar customers together to target them separately, detecting abnormal data, and clustering of text documents.
  • Gaussian mixure: This is a probabilistic model that is also used for data clustering such as k-means.
  • Power Iteration Clustering (PIC): This algorithm groups vertices of a graph based on pairwise edge similarities.
  • Latent Dirichlet Allocation (LDA): This algorithm is used to group collections of text documents into topics.
  • Streaming K-Means: This algorithm clusters streaming data dynamically using a windowing function on the incoming data. This is a really useful algorithm in Spark Streaming applications.

Dimensionality Reduction algorithms aim to reduce the number of features under consideration. This reduces noise in the data and focuses on key features. This type of algorithms include the following:

  • Singular Value Decomposition (SVD): This algorithm breaks the matrix that contains the data into simpler meaningful pieces. It factorizes the initial matrix into three matrices.
  • Principal Component Analysis (PCA): This algorithm approximates a high dimensional dataset with a low dimensional subspace.

Recommender systems

Recommender systems are used to recommend products or information to users. Examples are video recommendations on YouTube or Netflix.

Collaborative filtering forms the basis for recommender systems. It creates a user-item association matrix and aims to fill the gaps. Based on other users and items, along with their ratings, it recommends an item that the target user has no ratings for. In Spark, one of the most useful algorithms is Alternating Least Squares, which is described as follows:

  • Alternating Least Squares (ALS): ALS models the rating matrix (R) as the multiplication of low-rank user (U) and product (V) factors and learns these factors by minimizing the reconstruction error of the observed ratings. Input data can be of two types—explicit feedback or implicit feedback from users. In explicit feedback, the relationship between a set of user-item pairs is directly known, for example, the presence of movie ratings from different users on movies. In implicit feedback, the relationship does not exist directly. A recommender system has to infer user preferences from the presence or absence of movies watched or not, purchases, and click or search events. An implicit feedback problem is much harder than an explicit feedback problem. This is discussed in detail in Chapter 8, Building Recommendation Systems with Spark and Mahout.

Feature extraction and transformation

Feature extraction and transformation are essential techniques to process large text documents and other datasets. It has the following techniques:

  • Term frequency: Search engines use TF-IDF to score and rank document relevance in a vast corpus. It is also used in machine learning to determine the importance of a word in a document or corpus. term frequency (TF) statistically determines the weight of a term relative to its frequency in the corpus. Inverse Document Frequency (IDF) provides the specificity or measure of the amount of information, whether the term is rare or common across all documents in the corpus.
  • Word2Vec: This method computes the distributed vector representation of words. It includes two models—skip-gram and continuous bag of words. The skip-gram model predicts neighboring words in sliding windows of words. The continuous bag of words model predicts the current word given the neighboring words.
  • Standard Scaler: As part of preprocessing the dataset, it must often be standardized by mean removal and variance scaling. This computes the mean and standard deviation on the training data and applies the same transformation to the test data.
  • Normalizer: This scales the individual samples to have a unit norm.
  • Chi-Square Selector: This is a statistical method to measure the independence of two events.

Optimization

Optimization algorithms of MLlib focus on techniques of gradient descent. Spark provides an implementation of gradient descent on a distributed cluster of machines. This is compute-intensive as it iterates through all the data available. The optimization algorithms include the following:

  • Stochastic Gradient Descent: This minimizes an objective function that is the sum of differentiable functions. Stochastic Gradient Descent uses only a sample of the training data in order to update a parameter in a particular iteration. It is used for large-scale and sparse machine learning problems, such as text classification.
  • Limited-memory BFGS (L-BFGS): As the name says, L-BFGS uses limited memory and suits the distributed optimization algorithm implementation of Spark MLlib.

Spark MLlib data types

MLlib supports four data types used in MLlib algorithms—local vector, labeled point, local matrix, and distributed matrix:

We need to install numpy to work with MLlib data types and algorithms, so use the following commands to install numpy:

Note

All programs in this chapter are executed on CDH 5.8 VM. For other environments, file paths might change, but the concepts are the same in any environment.

[cloudera@quickstart ~]$ sudo yum -y install python-pip
[cloudera@quickstart ~]$ sudo pip install --upgrade pip
[cloudera@quickstart ~]$ sudo pip install 'numpy==1.8.1'
[cloudera@quickstart ~]$ sudo pip install 'scipy==0.9.0'
  • Local vector: This can be a dense or sparse vector that is stored on a single machine.

    A dense vector is a traditional array of doubles:

    >>> import numpy as np
    >>> import scipy.sparse as sps
    >>> from pyspark.mllib.linalg import Vectors
    >>> dv1 = np.array([2.0, 0.0, 5.0])
    >>> dv1
    array([ 2.,  0.,  5.])
    

    A sparse vector uses integer indices and double values:

    >>> sv1 = Vectors.sparse(2, [0, 3], [5.0, 1.0])
    >>> sv1
    SparseVector(2, {0: 5.0, 3: 1.0})
    
  • Labeled point: This can be a dense or sparse vector with a label used in supervised learning:
    >>> from pyspark.mllib.linalg import SparseVector
    >>> from pyspark.mllib.regression import LabeledPoint
    
    # Labeled point with a positive label and a dense feature vector
    >>> lp_pos = LabeledPoint(1.0, [4.0, 0.0, 2.0])
    >>> lp_pos
    LabeledPoint(1.0, [4.0,0.0,2.0])
    
    # Labeled point with a negative label and a sparse feature vector
    >>> lp_neg = LabeledPoint(0.0, SparseVector(5, [1, 2], [3.0, 5.0]))
    >>> lp_neg
    LabeledPoint(0.0, (5,[1,2],[3.0,5.0]))
    
  • Local matrix: This is a matrix with integer type indices and double type values. This is also stored on a single machine.
    from pyspark.mllib.linalg import Matrix, Matrices
    # Dense matrix ((1.0, 2.0, 3.0), (4.0, 5.0, 6.0))
    dMatrix = Matrices.dense(2, 3, [1, 2, 3, 4, 5, 6])
    # Sparse matrix ((9.0, 0.0), (0.0, 8.0), (0.0, 6.0))
    sMatrix = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
    
  • Distributed matrix: This matrix is stored distributively in one or more RDDs on multiple machines. There are four types of distributed matrix available—RowMatrix, IndexedRowMatrix, CoordinateMatrix, and BlockMatrix:
    • RowMatrix: This is a distributed matrix of rows with meaningless indices created from an RDD of vectors.
    • IndexedRowMatrix: Row indices are meaningful in this matrix. The RDD is created with indexed rows using the IndexedRow class and then IndexedRowMatrix is created.
    • CoordinateMatrix: This matrix is useful for very large and sparse matrices. CoordinateMatrix is created from RDDs of the MatrixEntry points, represented by a tuple of the long or float type.
    • BlockMatrix: BlockMatrix is created from RDDs of sub-matrix blocks.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset