Adapting through incremental algorithms

According to Robi Polikar et al., (Learn++: An Incremental Learning Algorithm for Supervised Neural Networks, IEEE Transactions on Systems, Man, And Cybernetics, V-21, No-4, November 2001), various algorithms have been suggested for incremental learning. The incremental learning is therefore implied for solving different problems. In some literature, the term incremental learning has been used to refer to either the growing of or pruning of a classifier. Alternatively, it may refer to the selection of most informative training samples for solving a problem in an incremental way.

In other cases, making a regular ML algorithm incremental means performing some form of controlled modification of weights in the classifier, by retraining with misclassified signals. Some algorithms are capable of learning new information; however, they do not synchronously satisfy all of the previously mentioned criteria. Moreover, they either require access to the old data or need to forget the prior knowledge along the way, and as they are unable to accommodate new classes they are not adaptable for new datasets.

Considering the previously mentioned issues, in this section, we will discuss how to adopt ML models using an incremental version of the original algorithms. Incremental SVM, Bayesian Network, and Neural networks will be discussed in brief. Moreover, when applicable, we will provide regular Spark implementation of these algorithms.

Incremental support vector machine

It's pretty difficult to make a regular ML algorithm incremental. In short, it's possible but not altogether easy. If you want to do it you have to change the underlying source codes in the Spark library you are using or implement the training algorithm yourself.

Unfortunately, Spark does not have an incremental version of SVM implemented. However, before making the linear SVM incremental, you need to first understand the linear SVM itself. Therefore, we provide some concepts of linear SVMs in the next sub-section using Spark for the new dataset.

Tip

According to our knowledge, we have found only two possible solutions called SVMHeavy (http://people.eng.unimelb.edu.au/shiltona/svm/) and LaSVM (http://leon.bottou.org/projects/lasvm), which support incremental training. But we haven't used either. Interested readers should follow these two papers on incremental SVMs to get some insight. These two papers are straightforward and show good research if you're just getting started:

http://cbcl.mit.edu/cbcl/publications/ps/cauwenberghs-nips00.pdf.

http://www.jmlr.org/papers/volume7/laskov06a/laskov06a.pdf.

Adapting SVMs for new data with Spark

In this section, we will first discuss how to perform binary classification using linear SVMs of Spark implementation. Then we will show how to adopt the same algorithm for the new data type.

Step 1: Data collection and exploration

We have collected a colon cancer dataset from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html. Originally, the dataset was labeled as -1.0 and 1.0 as follows:

Adapting SVMs for new data with Spark

Figure 4: Original colon cancer data snapshot

Tip

The dataset was used in the following publication: U. Alon, N. Barkai, D. A. Notterman, K. Gish, S.Ybarra, D.Mack, and A. J. Levine. Broad patterns of gene expression revealed by clustering analysis of tumour and normal colon tissues probed by oligonucleotide arrays. Cell Biology, 96:6745-6750, 1999. Interested readers should refer to the publication to get more insights into the dataset.

After that, instance-wise normalization is carried out to mean zero and variance one. Then feature wise normalization is carried out to get zero and variance one as a pre-processing step. However, for simplicity, we have considered -1.0 as 0.1, since SVM does not recognize symbols (that is, + or -). Therefore, the dataset now contains two labels 1 and 0 (that is, to say it's a binary classification problem). After pre-processing and scaling, there are two classes and 2000 features. Here is a sample of the dataset in Figure 5:

Adapting SVMs for new data with Spark

Figure 5: Pre-processed colon cancer data

Step 2: Load the necessary packages and APIs

Here is the code to load the necessary packages:

import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.classification.SVMModel; 
import org.apache.spark.mllib.classification.SVMWithSGD; 
import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics; 
import org.apache.spark.mllib.evaluation.MulticlassMetrics; 
import org.apache.spark.mllib.optimization.L1Updater; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.util.MLUtils; 
import org.apache.spark.sql.SparkSession; 

Step 3: Configure the Spark session

The following code helps us to create the Spark session:

SparkSession spark = SparkSession 
    .builder() 
    .appName("JavaLDAExample") 
    .master("local[*]") 
    .config("spark.sql.warehouse.dir", "E:/Exp/") 
    .getOrCreate(); 

Step 4: Create a Dataset out of the data

Here is the code to create a Dataset:

String path = "input/colon-cancer.data"; 
JavaRDD<LabeledPoint>data = MLUtils.loadLibSVMFile(spark.sparkContext(), path).toJavaRDD(); 

Step 5: Prepare the training and test sets

Here is the code to prepare the training and test sets:

    JavaRDD<LabeledPoint>training = data.sample(false, 0.8, 11L); 
training.cache(); 
    JavaRDD<LabeledPoint>test = data.subtract(training); 

Step 6: Build and train the SVM model

The following code illustrates how to build and train the SVM model: 

intnumIterations = 500; 
final SVMModel model = SVMWithSGD.train(training.rdd(), numIterations); 

Step 7: Compute the raw prediction score on the test set

Here is the code to compute the raw prediction:

JavaRDD<Tuple2<Object, Object>>scoreAndLabels = test.map( 
newFunction<LabeledPoint, Tuple2<Object, Object>>() { 
public Tuple2<Object, Object> call(LabeledPoint p) { 
          Double score = model.predict(p.features()); 
returnnew Tuple2<Object, Object>(score, p.label()); 
        }}); 

Step 8: Evaluate the model

Here is the code to evaluate the model:

BinaryClassificationMetrics metrics = new BinaryClassificationMetrics(JavaRDD.toRDD(scoreAndLabels)); 
System.out.println("Area Under PR = " + metrics.areaUnderPR()); 
System.out.println("Area Under ROC = " + metrics.areaUnderROC()); 
Area Under PR = 0.6266666666666666 
Area Under ROC = 0.875  

However, the value of ROC is between 0.5 and 1.0. Where the value is more than 0.8 this indicates a good classifier and if the value of ROC is less than 0.8, this signals a bad classifier. The SVMWithSGD.train() method by default performs Level Two (L2) regularization with the regularization parameter set to 1.0.

If you want to configure this algorithm, you should customize the SVMWithSGD further by creating a new object directly. After that, you can further the setter methods to set the value of the object.

Interestingly, all the other Spark MLlib algorithms can be customized this way. However, after the customization has been completed, you need to build the source code to make changes up to API level. Interested readers can add themselves to the Apache Spark mailing list if they want to contribute to the open source.

Note the source code of Spark is available on GitHub at the URL https://github.com/apache/spark as an open source and it sends pull requests to enrich Spark. More technical discussion can be found at the Spark website at http://spark.apache.org/.

For example, the following code produces a level one (L1) regularized variant of SVMs with the regularization parameter set to 0.1, and runs the training algorithm for 500 iterations as follows:

SVMWithSGD svmAlg = new SVMWithSGD(); 
svmAlg.optimizer() 
      .setNumIterations(500) 
      .setRegParam(0.1) 
      .setUpdater(new L1Updater()); 
final SVMModel model = svmAlg.run(training.rdd()); 

Your model is now trained. Now if you perform step 7 and step 8, the following metrics will be generated:

Area Under PR = 0.9380952380952381 
Area Under ROC = 0.95 

If you compare this result with the result produced in step 8, it's much better now, isn't it? However, depending on the data preparation, you might experience different results.

It indicates a better classification (please see also at https://www.researchgate.net/post/What_is_the_value_of_the_area_under_the_roc_curve_AUC_to_conclude_that_a_classifier_is_excellent). In this way the SVM can be optimized or adaptive for the new data type.

However, the parameters (that is, number of iterations, regression params, and updater) should be set accordingly.

Incremental neural networks

The incremental version of the neural network in R or Mat lab provides adaptability using the adapt function. Does this update instead of overwriting iteratively? To verify this statement, readers can try using the R or Mat lab version of the incremental neural network-based classifier that may need to select a subset of your first data chunk as the second chunk in training. If it is overwriting, when you use the trained net with the subset to test your first data chunk, it will likely poorly predict the data that does not belong to the subset.

Multilayer perceptron classification with Spark

To date, there is no implementation of the incremental version of the neural network in Spark yet. According to the API documentation provided at https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier, Spark's Multilayer Perceptron Classifier (MLPC) is a classifier based on the Feedforward Artificial Neural Network (FANN). The MLPC consists of multiple layers of nodes including hidden layers. Each layer is fully connected to the next layer and so on in a network. A node in the input layer represents the input data. All other nodes map inputs to outputs by a linear combination of the inputs with the node's weights w and bias b and by applying the activation function. The number of nodes N in the output layer corresponds to the number of classes.

MLPC also performs backpropagation for learning the model. Spark uses the logistic loss function for optimization and Limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) as an optimization routine. Note that the L-BFGS is an optimization algorithm in the family of Quasi-Newton Method (QNM) that approximates the Broyden-Fletcher-Goldfarb-Shanno algorithm using a limited main memory. To train the multilayer perceptron classifier, the following parameters need to be set:

  • Layer
  • Tolerance of iteration
  • The block size of the learning
  • Seed size
  • Max iteration number

Note the layers consist of the input, hidden, and output layers. Moreover, a smaller value of convergence tolerance will lead to higher accuracy with the cost of more iterations. The default block size parameter is 128 and the maximum number of iteration is set to be 100 as a default value. We suggest you set these values accordingly and carefully.

In this sub-section, we will show how Spark has implemented the neural network learning algorithms through the multilayer perception classifier on the Iris dataset.

Step 1: Dataset collection, processing, and exploration

The original Iris plant dataset was collected from the UCI machine learning repositories (http://www.ics.uci.edu/~mlearn/MLRepository.html) and then pre-processed, scaled to libsvm format by Chang et al., and placed as the libsvm a comprehensive library for support vector machine at (https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html) for the binary, multi-class, and multi-label classification task. The Iris dataset contains three classes and four features, where the sepal and petal lengths are scaled according to the libsvm format. More specifically, here is the attribute information:

  • Class: Iris Setosa, Iris Versicolour, Iris Virginica (column 1)
  • Sepal length in cm (column 2)
  • Sepal width in cm (column 3)
  • Petal length in cm (column 4)
  • Petal width in cm (column 5)

A snapshot of the dataset is shown in Figure 6:

Multilayer perceptron classification with Spark

Figure 6: Irish dataset snapshot

Step 2: Load the required packages and APIs

Here is the code to load the required packages and APIs:

import org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel; 
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier; 
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 
import com.example.SparkSession.UtilityForSparkSession; 

Step 3: Create a Spark session

The following code helps us to create the Spark session:

SparkSession spark = UtilityForSparkSession.mySession(); 

Note, the mySession() method that creates and returns a Spark session object is as follows:

public static SparkSession mySession() { 
SparkSession spark = SparkSession.builder() 
.appName("MultilayerPerceptronClassificationModel") 
.master("local[*]") 
.config("spark.sql.warehouse.dir", "E:/Exp/") 
.getOrCreate(); 
    return spark; 
  } 

Step 4: Parse and prepare the dataset

Load the input data as libsvm format:

String path = "input/iris.data"; 
Dataset<Row> dataFrame = spark.read().format("libsvm").load(path); 

Step 5: Prepare the training and test set

Prepare the training and test set: training = 70%, test = 30%, and seed = 12345L:

Dataset<Row>[] splits = dataFrame.randomSplit(new double[] { 0.7, 0.3 }, 12345L); 
Dataset<Row> train = splits[0]; 
Dataset<Row> test = splits[1]; 

Step 6: Specify the layers for the neural network

Specify the layers for the neural network. Here, input layer size 4 (features), two intermediate layers (that is, hidden layers) of size 4 and 3, and output size 3 (classes):

int[] layers = newint[] { 4, 4, 3, 3 }; 

Step 7: Create the multilayer perceptron estimator

Create the MultilayerPerceptronClassifier trainer and set its parameters. Here, set the value of param [[layers]] using the setLayers() method from Step 6. Set the convergence tolerance of iterations using the setTol() method, since, a smaller value will lead to higher accuracy with the cost of more iterations.

Note the default is 1E-4. Set the value of Param [[blockSize]] using the setBlockSize() method, where the default is 128KB. Set the seed for weight initialization if the weights using the setInitialWeights() are not set. Finally, set the maximum number of iterations using the setMaxIter() method, where the default is 100:

MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier() 
        .setLayers(layers)        
        .setTol(1E-4)         
        .setBlockSize(128)         
        .setSeed(12345L)  
        .setMaxIter(100); 

Step 8: Train the model

Train the MultilayerPerceptronClassificationModel using the preceding estimator from step 7:

MultilayerPerceptronClassificationModel model = trainer.fit(train); 

Step 9: Compute the accuracy on the test set

Here is the code to compute the accuracy on the test set:

Dataset<Row> result = model.transform(test); 
Dataset<Row> predictionAndLabels = result.select("prediction", "label"); 

Step 10: Evaluate the model

Evaluate the model, calculate the metrics`, and print the accuracy, weighted precision and weighted recall:

MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator().setMetricName("accuracy"); 
MulticlassClassificationEvaluator evaluator2 = new MulticlassClassificationEvaluator().setMetricName("weightedPrecision"); 
MulticlassClassificationEvaluator evaluator3 = new MulticlassClassificationEvaluator().setMetricName("weightedRecall"); 
System.out.println("Accuracy = " + evaluator.evaluate(predictionAndLabels)); 
System.out.println("Precision = " + evaluator2.evaluate(predictionAndLabels)); 
System.out.println("Recall = " + evaluator3.evaluate(predictionAndLabels)); 

The output should appear as follows:

Accuracy = 0.9545454545454546  
Precision = 0.9595959595959596 
Recall = 0.9545454545454546  

Step 11: Stop the Spark session

The following code is used to stop the Spark session:

spark.stop(); 

From the preceding prediction metrics, it is clear that the classification task is quite impressive. Now it's your turn to make your model adaptable. Now try training and testing with the new dataset and make your ML model adaptable.

Incremental Bayesian networks

As we discussed earlier, Naive Bayes is a simple multiclass classification algorithm with the assumption of independence between each pair of features. The Naive Bayes based model can be trained very efficiently. The model can compute the conditional probability distribution of each feature, given the label, since a pass to the training data. After that, it applies the Bayes theorem to compute the conditional probability distribution of the labels for making the prediction.

However, there is still no implementation of the incremental version of the Bayesian network into Spark yet. According to the API documentation provided at http://spark.apache.org/docs/latest/mllib-naive-bayes.html, each observation is a document and each feature represents a term. The value of an observation is the frequency of the term or a zero or one. This value indicates if the term has been found in the document for the multinomial Naive Bayes and Bernoulli Naive Bayes respectively for the document classification.

Note that as with linear SVM-based learning, here the feature values must be non-negative too. The type of the model is selected with an optional parameter, multinomial or Bernoulli. The default model type is multinomial. Furthermore, additive smoothing (that is, lambda) can be used by setting the parameter λ. Note the default of lambda is 1.0.

More technical details on the big data approach of Bayesian network based learning can be found in the paper: A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning b y Jianwu W., et al., (http://users.sdsc.edu/~jianwu/JianwuWang_files/A_Scalable_Data_Science_Workflow_Approach_for_Big_Data_Bayesian_Network_Learning.pdf).

Interested readers also should refer to the following publications for more insight into the incremental Bayesian networks:

Classification using Naive Bayes with Spark

The current implementation in Spark MLlib supports both the multinomial Naive Bayes and Bernoulli Naive Bayes. However, the incremental version has not been implemented yet. Therefore, in this section, we will show you how to perform the classification using the Spark MLlib version of Naïve Bayes on the Vehicle Scale dataset to provide you with some concepts of the Naïve Bayes based learning.

Note, due to the low accuracy and precision using Spark ML, we did not provide the Pipeline version but implemented the same using only Spark MLlib. Moreover, if you have suitable and better data, you can try to implement the Spark ML version with ease.

Step 1: Data collection, pre-processing, and exploration

The dataset was downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#aloi and provided by David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361-397, 2004.

Pre-processing: For the pre-processing, two steps were considered as follows:

  • The label hierarchy is reorganized by mapping the data set to the second level of RCV1 (that is, revision) topic hierarchy. The documents, having the third or fourth level, are mapped to their parent category of the second level only. Consequently, documents having the first level are not considered for creating the mapping.
  • Multi-labeled instances were removed since the current implementation of multi-level classifier in Spark is not robust enough.

After performing these two steps, there are finally 53 classes and 47,236 features collected. Here is a snapshot of the dataset shown in Figure 7:

Classification using Naive Bayes with Spark

Figure 7: RCV1 topic hierarchy dataset

Step 2: Load the required library and packages

Here is the code to load the library and packages:

import org.apache.spark.api.java.JavaPairRDD; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.api.java.function.PairFunction; 
import org.apache.spark.mllib.classification.NaiveBayes; 
import org.apache.spark.mllib.classification.NaiveBayesModel; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.util.MLUtils; 
import org.apache.spark.sql.SparkSession; 
importscala.Tuple2; 

Step 3: Initiate a Spark session

The following code helps us to create the Spark session:

static SparkSession spark = SparkSession 
      .builder() 
      .appName("JavaLDAExample").master("local[*]") 
      .config("spark.sql.warehouse.dir", "E:/Exp/") 
      .getOrCreate();  

Step 4: Prepare LabeledPoint RDDs

Parse the dataset in the libsvm format and prepare LabeledPoint RDDs:

static String path = "input/rcv1_train.multiclass.data"; 
JavaRDD<LabeledPoint> inputData = MLUtils.loadLibSVMFile(spark.sparkContext(), path).toJavaRDD();  

For document classification, the input feature vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of sparsity. Since the training data is only used once, it is not necessary to cache it.

Step 5: Prepare the training and test set

Here is the code to prepare the training and test set:

JavaRDD<LabeledPoint>[] split = inputData.randomSplit(new double[]{0.8, 0.2}, 12345L); 
JavaRDD<LabeledPoint> training = split[0];  
JavaRDD<LabeledPoint> test = split[1]; 

Step 6: Train the Naive Bayes model

Train a Naive Bayes model by specifying the model type as multinomial and lambda = 1.0, which is the default and suitable for the multiclass classification of any features. However, note that Bernoulli naive Bayes requires 0 or 1 feature values:

final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0, "multinomial"); 

Step 7: Calculate the prediction on the test dataset

Here is the code to calculate the prediction:

JavaPairRDD<Double,Double> predictionAndLabel = 
test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { 
@Override 
public Tuple2<Double, Double> call(LabeledPoint p) { 
return new Tuple2<>(model.predict(p.features()), p.label()); 
          } 
        }); 

Step 8: Calculate the prediction accuracy

Here is the code to calculate the prediction accuracy:

double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { 
@Override 
public Boolean call(Tuple2<Double, Double>pl) { 
returnpl._1().equals(pl._2()); 
        } 
      }).count() / (double) test.count(); 

Step 9: Print the accuracy

Here is the code to print the accuracy:

System.out.println("Accuracy of the classification: "+accuracy); 

This provides the following output:

Accuracy of the classification: 0.5941753719531497  

This is pretty low, right? This is as we discussed when we tuned the ML models in Chapter 7, Tuning Machine Learning Models. There are further opportunities to improve the prediction accuracy by selecting appropriate algorithms (that is, classifier or regressor) via cross-validation and train split.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset