The generalization of ML models

In Chapter 5, Supervised and Unsupervised Learning by Example, we discussed how and why to generalize the learning algorithm to fit semi-supervised learning, active learning, structured prediction, and reinforcement learning. In this section, we will discuss how to generalize the linear regression algorithm on the Optical Character Recognition (OCR) dataset to show an example of generalization of the linear regression model.

Generalized linear regression

As discussed in Chapter 5, Supervised and Unsupervised Learning by Example, the linear regression and logistic regression techniques assume that the output follows a Gaussian distribution. The generalized linear models (GLMs) on the other hand are specifications of the linear models where the response variable, that is, Yi, follows the linear distribution from an exponential family of distribution.

Spark's GeneralizedLinearRegression API allows us a flexible specification of the GLMs. The current implementation of the generalized linear regression in Spark can be used for numerous types of prediction problems, for example, linear regression, Poisson regression, logistic regression, and others.

However, the limitation is that only a subset of the exponential family distributions is supported in the current implementation of the Spark-based GLM algorithm. In addition, there is another scalability issue that only 4096 features are supported through its GeneralizedLinearRegression API. Consequently, if the number of features are beyond 4096, the algorithm will throw an exception.

Fortunately, your models with an increased number of features can be trained using the LinearRegression and LogisticRegression estimators, as shown in several examples in Chapter 6, Building Scalable Machine Learning Pipelines.

Generalized linear regression with Spark

In this sub-section, we discuss a step-by-step example that shows how to apply the generalized linear regression on the libsvm version of the Optical Character Recognition (OCR) data that we discussed in Chapter 7, Tuning Machine Learning Models. Since the same dataset will be re-used here, we've decided not to describe them further.

Step 1: Load the necessary API and packages

Here is the code to load the necessary API and packages:

import java.util.Arrays; 
import org.apache.spark.ml.regression.GeneralizedLinearRegression; 
import org.apache.spark.ml.regression.GeneralizedLinearRegressionModel; 
import org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 

Step 2: Create the Spark session

The following code shows how to create the Spark session:

SparkSession spark = SparkSession 
    .builder() 
    .appName("JavaGeneralizedLinearRegressionExample") 
    .master("local[*]") 
    .config("spark.sql.warehouse.dir", "E:/Exp/") 
    .getOrCreate();  

Step 3: Load and create the Dataset

Load and create the dataset from the OCR dataset. Here we have specified the dataset format as libsvm:

Dataset<Row>dataset = spark.read().format("libsvm").load("input/Letterdata_libsvm.data"); 

Step 4: Prepare the training and test sets

The following code illustrates how to prepare the training and test sets:

double[] weights = {0.8, 0.2}; 
long seed = 12345L; 
Dataset<Row>[] split = dataset.randomSplit(weights, seed); 
Dataset<Row> training = split[0]; 
Dataset<Row> test = split[1]; 

Step 5: Create a generalized linear regression estimator

Create the generalized linear regression estimator by specifying the family, link, and max iteration and regression parameters. Here we have selected the family as "gaussian" and link as "identity":

GeneralizedLinearRegression glr = new GeneralizedLinearRegression() 
.setFamily("gaussian") 
.setLink("identity") 
.setMaxIter(10) 
.setRegParam(0.3); 

Note that according to the API documentation at http://spark.apache.org/docs/latest/ml-classification-regression.html#generalized-linear-regression, the following options are supported with this algorithm implementation with Spark:

Generalized linear regression with Spark

Figure 2: Available supported families with the current implementation of Generalized Linear Regression

Step 6: Fit the model

Here is the code to fit the model:

GeneralizedLinearRegressionModel model = glr.fit(training); 

Step 7: Check the coefficients and intercept

Print the coefficients and intercept for the linear regression model that we created in Step 6:

System.out.println("Coefficients: " + model.coefficients()); 
System.out.println("Intercept: " + model.intercept()); 

Output for these two parameters will be similar to the following:

Coefficients: [-0.0022864381796305487,-0.002728958263362158,0.001582003618682323,-0.0027708788253722914,0.0021962329827476565,-0.014769839282003813,0.027752802299957722,0.005757124632688538,0.013869444611365267,-0.010555326094498824,-0.006062727351948948,-0.01618167221020619,0.02894330366681715,-0.006180003317929849,-0.0025768386348180294,0.015161831324693125,0.8125261496082304] 
Intercept: 1.2140016821111255  

It is to be noted that the System.out.println method will not work in cluster mode. This will work only in standalone or Pseudo mode. It is only for the verification of the result.

Step 8: Summarize the model

Summarize the model over the training set and print out some metrics:

GeneralizedLinearRegressionTrainingSummary summary = model.summary(); 

Step 9: Verify some generalized metrics

Let's print some generalized metrics such as Coefficient Standard Errors (CSE), T values, P values, Dispersions, Null deviance, the Residual degree of freedom null, AIC and Deviance residuals. Owing to the page limitations, we have not shown how these values are significant or the calculations procedure:

System.out.println("Coefficient Standard Errors: " 
      + Arrays.toString(summary.coefficientStandardErrors())); 
System.out.println("T Values: " + Arrays.toString(summary.tValues())); 
System.out.println("P Values: " + Arrays.toString(summary.pValues())); 
System.out.println("Dispersion: " + summary.dispersion()); 
System.out.println("Null Deviance: " + summary.nullDeviance()); 
System.out.println("Residual Degree Of Freedom Null: " + summary.residualDegreeOfFreedomNull()); 
System.out.println("Deviance: " + summary.deviance()); 
System.out.println("Residual Degree Of Freedom: " + summary.residualDegreeOfFreedom()); 
    System.out.println("AIC: " + summary.aic()); 

Let's see the values for the training set we created previously:

Coefficient Standard Errors:[2.877963555951775E-4, 0.0016618949921257992, 9.147115254397696E-4, 0.001633197607413805, 0.0013194682048354774, 0.001427648472211677, 0.0010797461071614422, 0.001092731825368789, 7.922778963434026E-4, 9.413717346009722E-4, 8.746375698587989E-4, 9.768068714323967E-4, 0.0010276211138097238, 0.0011457739746946476, 0.0015025626835648176, 9.048329671989396E-4, 0.0013145697411570455, 0.02274018067510297] 
T Values:[-7.944639100457261, -1.6420762300218703, 1.729510971146599, -1.6965974067032972, 1.6644834446931607, -10.345571455081481, 25.703081600282317, 5.2685613240426585, 17.50578259898057, -11.212707697212734, -6.931702411237277, -16.56588695621814, 28.165345454527458, -5.3937368577226055, -1.714962485760994, 16.756497468951743, 618.0928437414578, 53.385753589911985] 
P Values:[1.9984014443252818E-15, 0.10059394323065063, 0.08373705354670546, 0.0897923347927514, 0.09603552109755675, 0.0, 0.0, 1.3928712139232857E-7, 0.0, 0.0, 4.317657342767234E-12, 0.0, 0.0, 6.999167956323049E-8, 0.08637155105770145, 0.0, 0.0, 0.0] 
Dispersion: 0.07102433332236015  
Null Deviance: 41357.85510971454 
Residual Degree Of Freedom Null: 15949 
Deviance: 1131.5596784918419 
Residual Degree Of Freedom: 15932 
AIC: 3100.6418768238423  

Step 10: Show the deviance residuals

The following code is used to show the deviance residuals:

summary.residuals().show(); 
Generalized linear regression with Spark

Figure 3: Summary of the deviance residuals for the OCR dataset

Tip

Interested readers should refer to the following web page to get more information on and insight into this algorithm and its implementation details: http://spark.apache.org/docs/latest/ml-classification-regression.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset