Classifying Spark MLlib algorithms

Spark MLlib is a rapidly evolving module of Spark with new algorithms added with each release of Spark.

The following diagram provides a high-level overview of Spark MLlib algorithms grouped in the traditional broad machine learning techniques and following the categorical or continuous nature of the data:

Classifying Spark MLlib algorithms

We categorize the Spark MLlib algorithms in two columns, categorical or continuous, depending on the type of data. We distinguish between data that is categorical or more qualitative in nature versus continuous data, which is quantitative in nature. An example of qualitative data is predicting the weather; given the atmospheric pressure, the temperature, and the presence and type of clouds, the weather will be sunny, dry, rainy, or overcast. These are discrete values. On the other hand, let's say we want to predict house prices, given the location, square meterage, and the number of beds; the real estate value can be predicted using linear regression. In this case, we are talking about continuous or quantitative values.

The horizontal grouping reflects the types of machine learning method used. Unsupervised versus supervised machine learning techniques are dependent on whether the training data is labeled. In an unsupervised learning challenge, no labels are given to the learning algorithm. The goal is to find the hidden structure in its input. In the case of supervised learning, the data is labeled. The focus is on making predictions using regression if the data is continuous or classification if the data is categorical.

An important category of machine learning is recommender systems, which leverage collaborative filtering techniques. The Amazon web store and Netflix have very powerful recommender systems powering their recommendations.

Stochastic Gradient Descent is one of the machine learning optimization techniques that is well suited for Spark distributed computation.

For processing large amounts of text, Spark offers crucial libraries for feature extraction and transformation such as TF-IDF (short for Term Frequency – Inverse Document Frequency), Word2Vec, standard scaler, and normalizer.

Supervised and unsupervised learning

We delve more deeply here in to the traditional machine learning algorithms offered by Spark MLlib. We distinguish between supervised and unsupervised learning depending on whether the data is labeled. We distinguish between categorical or continuous depending on whether the data is discrete or continuous.

The following diagram explains the Spark MLlib supervised and unsupervised machine learning algorithms and preprocessing techniques:

Supervised and unsupervised learning

The following supervised and unsupervised MLlib algorithms and preprocessing techniques are currently available in Spark:

  • Clustering: This is an unsupervised machine learning technique where the data is not labeled. The aim is to extract structure from the data:
    • K-Means: This partitions the data in K distinct clusters
    • Gaussian Mixture: Clusters are assigned based on the maximum posterior probability of the component
    • Power Iteration Clustering (PIC): This groups vertices of a graph based on pairwise edge similarities
    • Latent Dirichlet Allocation (LDA): This is used to group collections of text documents into topics
    • Streaming K-Means: This means clusters dynamically streaming data using a windowing function on the incoming data
  • Dimensionality Reduction: This aims to reduce the number of features under consideration. Essentially, this reduces noise in the data and focuses on the key features:
    • Singular Value Decomposition (SVD): This breaks the matrix that contains the data into simpler meaningful pieces. It factorizes the initial matrix into three matrices.
    • Principal Component Analysis (PCA): This approximates a high dimensional dataset with a low dimensional sub space.
  • Regression and Classification: Regression predicts output values using labeled training data, while Classification groups the results into classes. Classification has dependent variables that are categorical or unordered whilst Regression has dependent variables that are continuous and ordered:
    • Linear Regression Models (linear regression, logistic regression, and support vector machines): Linear regression algorithms can be expressed as convex optimization problems that aim to minimize an objective function based on a vector of weight variables. The objective function controls the complexity of the model through the regularized part of the function and the error of the model through the loss part of the function.
    • Naive Bayes: This makes predictions based on the conditional probability distribution of a label given an observation. It assumes that features are mutually independent of each other.
    • Decision Trees: This performs recursive binary partitioning of the feature space. The information gain at the tree node level is maximized in order to determine the best split for the partition.
    • Ensembles of trees (Random Forests and Gradient-Boosted Trees): Tree ensemble algorithms combine base decision tree models in order to build a performant model. They are intuitive and very successful for classification and regression tasks.
  • Isotonic Regression: This minimizes the mean squared error between given data and observed responses.

Additional learning algorithms

Spark MLlib offers more algorithms than the supervised and unsupervised learning ones. We have broadly three more additional types of machine learning methods: recommender systems, optimization algorithms, and feature extraction.

Additional learning algorithms

The following additional MLlib algorithms are currently available in Spark:

  • Collaborative filtering: This is the basis for recommender systems. It creates a user-item association matrix and aims to fill the gaps. Based on other users and items along with their ratings, it recommends an item that the target user has no ratings for. In distributed computing, one of the most successful algorithms is ALS (short for Alternating Least Square):
    • Alternating Least Squares: This matrix factorization technique incorporates implicit feedback, temporal effects, and confidence levels. It decomposes the large user item matrix into a lower dimensional user and item factors. It minimizes a quadratic loss function by fixing alternatively its factors.
  • Feature extraction and transformation: These are essential techniques for large text document processing. It includes the following techniques:
    • Term Frequency: Search engines use TF-IDF to score and rank document relevance in a vast corpus. It is also used in machine learning to determine the importance of a word in a document or corpus. Term frequency statistically determines the weight of a term relative to its frequency in the corpus. Term frequency on its own can be misleading as it overemphasizes words such as the, of, or and that give little information. Inverse Document Frequency provides the specificity or the measure of the amount of information, whether the term is rare or common across all documents in the corpus.
    • Word2Vec: This includes two models, Skip-Gram and Continuous Bag of Word. The Skip-Gram predicts neighboring words given a word, based on sliding windows of words, while Continuous Bag of Words predicts the current word given the neighboring words.
    • Standard Scaler: As part of preprocessing, the dataset must often be standardized by mean removal and variance scaling. We compute the mean and standard deviation on the training data and apply the same transformation to the test data.
    • Normalizer: We scale the samples to have unit norm. It is useful for quadratic forms such as the dot product or kernel methods.
    • Feature selection: This reduces the dimensionality of the vector space by selecting the most relevant features for the model.
    • Chi-Square Selector: This is a statistical method to measure the independence of two events.
  • Optimization: These specific Spark MLlib optimization algorithms focus on various techniques of gradient descent. Spark provides very efficient implementation of gradient descent on a distributed cluster of machines. It looks for the local minima by iteratively going down the steepest descent. It is compute-intensive as it iterates through all the data available:
    • Stochastic Gradient Descent: We minimize an objective function that is the sum of differentiable functions. Stochastic Gradient Descent uses only a sample of the training data in order to update a parameter in a particular iteration. It is used for large-scale and sparse machine learning problems such as text classification.
  • Limited-memory BFGS (L-BFGS): As the name says, L-BFGS uses limited memory and suits the distributed optimization algorithm implementation of Spark MLlib.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset