Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

ML libraries

Spark, particularly with memory-based storage systems, claims to substantially improve the speed of data access within and between nodes. ML seems to be a natural fit, as many algorithms require multiple passes over the data, or repartitioning. MLlib is the open source library of choice, although private companies are catching, up with their own proprietary implementations.

As I will chow in Chapter 5, Regression and Classification, most of the standard machine learning algorithms can be expressed as an optimization problem. For example, classical linear regression minimizes the sum of squares of y distance between the regression line and the actual value of y:

Here, are the predicted values according to the linear expression:

A is commonly called the slope, and B the intercept. In a more generalized formulation, a linear optimization problem is to minimize an additive function:

Here, is a loss function and is a regularization function. The regularization function is an increasing function of model complexity, for example, the number of parameters (or a natural logarithm thereof). Most common loss functions are given in the following table:

	Loss function L	Gradient
Linear
Logistic
Hinge

The purpose of the regularizer is to penalize more complex models to avoid overfitting and improve generalization error: more MLlib currently supports the following regularizers:

	Regularizer R	Gradient
L2
L1
Elastic net

Here, sign(w) is the vector of the signs of all entries of w.

Currently, MLlib includes implementation of the following algorithms:

Basic statistics:
- Summary statistics
- Correlations
- Stratified sampling
- Hypothesis testing
- Streaming significance testing
- Random data generation
Classification and regression:
- Linear models (SVMs, logistic regression, and linear regression)
- Naive Bayes
- Decision trees
- Ensembles of trees (Random Forests and Gradient-Boosted Trees)
- Isotonic regression
Collaborative filtering:
- Alternating least squares (ALS)
Clustering:
- k-means
- Gaussian mixture
- Power Iteration Clustering (PIC)
- Latent Dirichlet allocation (LDA)
- Bisecting k-means
- Streaming k-means
Dimensionality reduction:
- Singular Value Decomposition (SVD)
- Principal Component Analysis (PCA)
Feature extraction and transformation
Frequent pattern mining:
- FP-growth?Association rules
- PrefixSpan
Optimization:
- Stochastic Gradient Descent (SGD)
- Limited-Memory BFGS (L-BFGS)

I will go over some of the algorithms in Chapter 5, Regression and Classification. More complex non-structured machine learning methods will be considered in Chapter 6, Working with Unstructured Data.

SparkR

R is an implementation of popular S programming language created by John Chambers while working at Bell Labs. R is currently supported by the R Foundation for Statistical Computing. R's popularity has increased in recent years according to polls. SparkR provides a lightweight frontend to use Apache Spark from R. Starting with Spark 1.6.0, SparkR provides a distributed DataFrame implementation that supports operations such as selection, filtering, aggregation, and so on, which is similar to R DataFrames, dplyr, but on very large datasets. SparkR also supports distributed machine learning using MLlib.

SparkR required R version 3 or higher, and can be invoked via the ./bin/sparkR shell. I will cover SparkR in Chapter 8, Integrating Scala with R and Python.

Graph algorithms – GraphX and GraphFrames

Graph algorithms are one of the hardest to correctly distribute between nodes, unless the graph itself is naturally partitioned, that is, it can be represented by a set of disconnected subgraphs. Since the social networking analysis on a multi-million node scale became popular due to companies such as Facebook, Google, and LinkedIn, researches have been coming up with new approaches to formalize the graph representations, algorithms, and types of questions asked.

GraphX is a modern framework for graph computations described in a 2013 paper (GraphX: A Resilient Distributed Graph System on Spark by Reynold Xin, Joseph Gonzalez, Michael Franklin, and Ion Stoica, GRADES (SIGMOD workshop), 2013). It has graph-parallel frameworks such as Pregel, and PowerGraph as predecessors. The graph is represented by two RDDs: one for vertices and another one for edges. Once the RDDs are joined, GraphX supports either Pregel-like API or MapReduce-like API, where the map function is applied to the node's neighbors and reduce is the aggregation step on top of the map results.

At the time of writing, GraphX includes the implementation for the following graph algorithms:

PageRank
Connected components
Triangle counting
Label propagation
SVD++ (collaborative filtering)
Strongly connected components

As GraphX is an open source library, changes to the list are expected. GraphFrames is a new implementation from Databricks that fully supports the following three languages: Scala, Java, and Python, and is build on top of DataFrames. I'll discuss specific implementations in Chapter 7, Working with Graph Algorithms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for ML libraries

Create new playlist

Sign In

Sign Up

ML libraries

SparkR

Graph algorithms – GraphX and GraphFrames

Table of Contents for
ML libraries