ML libraries

Spark, particularly with memory-based storage systems, claims to substantially improve the speed of data access within and between nodes. ML seems to be a natural fit, as many algorithms require multiple passes over the data, or repartitioning. MLlib is the open source library of choice, although private companies are catching, up with their own proprietary implementations.

As I will chow in Chapter 5, Regression and Classification, most of the standard machine learning algorithms can be expressed as an optimization problem. For example, classical linear regression minimizes the sum of squares of y distance between the regression line and the actual value of y:

ML libraries

Here, ML libraries are the predicted values according to the linear expression:

ML libraries

A is commonly called the slope, and B the intercept. In a more generalized formulation, a linear optimization problem is to minimize an additive function:

ML libraries

Here, ML libraries is a loss function and ML libraries is a regularization function. The regularization function is an increasing function of model complexity, for example, the number of parameters (or a natural logarithm thereof). Most common loss functions are given in the following table:

 

Loss function L

Gradient

Linear

ML libraries
ML libraries

Logistic

ML libraries
ML libraries

Hinge

ML libraries
ML libraries

The purpose of the regularizer is to penalize more complex models to avoid overfitting and improve generalization error: more MLlib currently supports the following regularizers:

 

Regularizer R

Gradient

L2

ML libraries
ML libraries

L1

ML libraries
ML libraries

Elastic net

ML libraries
ML libraries

Here, sign(w) is the vector of the signs of all entries of w.

Currently, MLlib includes implementation of the following algorithms:

  • Basic statistics:
    • Summary statistics
    • Correlations
    • Stratified sampling
    • Hypothesis testing
    • Streaming significance testing
    • Random data generation
  • Classification and regression:
    • Linear models (SVMs, logistic regression, and linear regression)
    • Naive Bayes
    • Decision trees
    • Ensembles of trees (Random Forests and Gradient-Boosted Trees)
    • Isotonic regression
  • Collaborative filtering:
    • Alternating least squares (ALS)
  • Clustering:
    • k-means
    • Gaussian mixture
    • Power Iteration Clustering (PIC)
    • Latent Dirichlet allocation (LDA)
    • Bisecting k-means
    • Streaming k-means
  • Dimensionality reduction:
    • Singular Value Decomposition (SVD)
    • Principal Component Analysis (PCA)
  • Feature extraction and transformation
  • Frequent pattern mining:
    • FP-growth?Association rules
    • PrefixSpan
  • Optimization:
    • Stochastic Gradient Descent (SGD)
    • Limited-Memory BFGS (L-BFGS)

I will go over some of the algorithms in Chapter 5, Regression and Classification. More complex non-structured machine learning methods will be considered in Chapter 6, Working with Unstructured Data.

SparkR

R is an implementation of popular S programming language created by John Chambers while working at Bell Labs. R is currently supported by the R Foundation for Statistical Computing. R's popularity has increased in recent years according to polls. SparkR provides a lightweight frontend to use Apache Spark from R. Starting with Spark 1.6.0, SparkR provides a distributed DataFrame implementation that supports operations such as selection, filtering, aggregation, and so on, which is similar to R DataFrames, dplyr, but on very large datasets. SparkR also supports distributed machine learning using MLlib.

SparkR required R version 3 or higher, and can be invoked via the ./bin/sparkR shell. I will cover SparkR in Chapter 8, Integrating Scala with R and Python.

Graph algorithms – GraphX and GraphFrames

Graph algorithms are one of the hardest to correctly distribute between nodes, unless the graph itself is naturally partitioned, that is, it can be represented by a set of disconnected subgraphs. Since the social networking analysis on a multi-million node scale became popular due to companies such as Facebook, Google, and LinkedIn, researches have been coming up with new approaches to formalize the graph representations, algorithms, and types of questions asked.

GraphX is a modern framework for graph computations described in a 2013 paper (GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, and Ion Stoica, GRADES (SIGMOD workshop), 2013). It has graph-parallel frameworks such as Pregel, and PowerGraph as predecessors. The graph is represented by two RDDs: one for vertices and another one for edges. Once the RDDs are joined, GraphX supports either Pregel-like API or MapReduce-like API, where the map function is applied to the node's neighbors and reduce is the aggregation step on top of the map results.

At the time of writing, GraphX includes the implementation for the following graph algorithms:

  • PageRank
  • Connected components
  • Triangle counting
  • Label propagation
  • SVD++ (collaborative filtering)
  • Strongly connected components

As GraphX is an open source library, changes to the list are expected. GraphFrames is a new implementation from Databricks that fully supports the following three languages: Scala, Java, and Python, and is build on top of DataFrames. I'll discuss specific implementations in Chapter 7, Working with Graph Algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset