Adapting through reusing ML models

In this section, we will describe how to make a machine learning model adaptable for new datasets. An example will be shown for the prediction of heart disease. At first we will describe the problem statement, and then we will explore the heart diseases dataset. Following the dataset exploration, we will train and save the model to local storage. After that the model will be evaluated to see how it performs. Finally, we will reuse/reload the same model trained to work for the new data type.

More specifically, we will show how to predict the possibility of future heart disease by using the Spark machine learning APIs including Spark MLlib, Spark ML, and Spark SQL.

Problem statements and objectives

Machine learning and big data together are a radical combination that has created some great impacts in the field of research to academia and the industry as well as in the biomedical sector. In the area of biomedical data analytics, this carries a better impact on a real dataset for diagnosis and prognosis for better healthcare. Moreover, life science research is also entering into big data since datasets are being generated and produced in an unprecedented way. This imposes great challenges to machine learning and bioinformatics tools and algorithms to find the VALUE from big data criteria such as volume, velocity, variety, veracity, visibility, and value.

Data exploration

In recent times, biomedical research has advanced enormously and more and more life sciences datasets are being generated making many of them open source. However, for simplicity and ease, we have decided to use the Cleveland database. To date, most of the researchers who have applied the machine learning technique to biomedical data analytics have used this dataset. According to the dataset description at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names, this heart disease dataset is one of the most used and well-studied datasets by researchers from biomedical data analytics and machine learning fields, respectively.

The dataset is freely available at the UCI machine learning dataset repository at https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/. This data contains a total of 76 attributes, however, most of the published research papers refer to using a subset of only 14 features in the field. The "goal" field is used to refer to if the heart diseases are present or absent. It has five possible values ranging from 0 to 4. The value 0 signifies no presence of heart diseases. The values 1 and 2 signify that the disease is present but in the primary stage. The values 3 and 4, on the other hand, indicate the strong possibility of heart disease. Biomedical laboratory experiments with the Cleveland dataset have simply attempted to distinguish presence (values 1, 2, 3, 4) from absence (value 0). In short, the higher the value the more the disease is possible and the more evidence of the presence there is. Another thing is that privacy is an important concern in the area of biomedical data analytics as well as all kinds of diagnosis and prognosis. Therefore, the names and social security numbers of the patients were recently removed from the dataset to avoid the privacy issue. Consequently, those values have been replaced with dummy values instead.

It is to be noted that three files have been processed, containing the Cleveland, Hungarian, and Switzerland datasets altogether. All four unprocessed files also exist in this directory. To demonstrate the example, we will use the Cleveland dataset for training and evaluating the models. However, the Hungarian dataset will be used to re-use the saved model. As we have said already, although the number of attributes is 76 (including the predicted attribute), like other ML/Biomedical researchers, we will also use only 14 attributes with the following attribute information:

No.

Attribute name

Explanation

1

age

Age in years

2

sex

Either male or female: sex (1 = male; 0 = female)

3

cp

Chest pain type:

— Value 1: typical angina

— Value 2: atypical angina

— Value 3: non-angina pain

— Value 4: asymptomatic

4

trestbps

Resting blood pressure (in mm Hg on admission to the hospital)

5

chol

Serum cholesterol in mg/dl

6

fbs

Fasting blood sugar. If > 120 mg/dl)(1 = true; 0 = false)

7

restecg

Resting electrocardiographic results:

— Value 0: normal

— Value 1: having ST-T wave abnormality

— Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria.

8

thalach

Maximum heart rate achieved

9

exang

Exercise induced angina (1 = yes; 0 = no)

10

oldpeak

ST depression induced by exercise relative to rest

11

slope

The slope of the peak exercise ST segment

— Value 1: upsloping

— Value 2: flat

— Value 3: down-sloping

12

ca

Number of major vessels (0-3) colored by fluoroscopy

13

thal

Heart rate:

—Value 3 = normal;

—Value 6 = fixed defect

—Value 7 = reversible defect

14

num

Diagnosis of heart disease (angiographic disease status)

— Value 0: < 50% diameter narrowing

— Value 1: > 50% diameter narrowing

Table 1: Dataset characteristics

A sample snapshot of the dataset is given as follows:

Data exploration

Figure 8: A sample snapshot of the heart diseases dataset

Developing a heart diseases predictive model

Step 1: Loading the required packages and APIs

The following packages and APIs need to be imported for our purpose. We believe the packages are self-explanatory if you have the minimum working experience with Spark 2.0.0:

import java.util.HashMap; 
import java.util.List; 
import org.apache.spark.api.java.JavaPairRDD; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.api.java.function.PairFunction; 
import org.apache.spark.ml.classification.LogisticRegression; 
import org.apache.spark.mllib.classification.LogisticRegressionModel; 
import org.apache.spark.mllib.classification.NaiveBayes; 
import org.apache.spark.mllib.classification.NaiveBayesModel; 
import org.apache.spark.mllib.linalg.DenseVector; 
import org.apache.spark.mllib.linalg.Vector; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.mllib.regression.LinearRegressionModel; 
import org.apache.spark.mllib.regression.LinearRegressionWithSGD; 
import org.apache.spark.mllib.tree.DecisionTree; 
import org.apache.spark.mllib.tree.RandomForest; 
import org.apache.spark.mllib.tree.model.DecisionTreeModel; 
import org.apache.spark.mllib.tree.model.RandomForestModel; 
import org.apache.spark.rdd.RDD; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 
import com.example.SparkSession.UtilityForSparkSession; 
import javassist.bytecode.Descriptor.Iterator; 
import scala.Tuple2; 

Step 2: Create an active Spark session

The following code helps us to create the Spark session:

SparkSession spark = UtilityForSparkSession.mySession(); 

Here is the UtilityForSparkSession class that creates and returns an active Spark session:

import org.apache.spark.sql.SparkSession; 
public class UtilityForSparkSession { 
  public static SparkSession mySession() { 
    SparkSession spark = SparkSession 
                          .builder() 
                          .appName("UtilityForSparkSession") 
                          .master("local[*]") 
                          .config("spark.sql.warehouse.dir", "E:/Exp/") 
                          .getOrCreate(); 
    return spark; 
  } 
} 

Note that here in the Windows 7 platform, we have set the Spark SQL warehouse as E:/Exp/, but set your path accordingly, based on your operating system.

Step 3: Data parsing and RDD of Labelpoint creation

Take the input as a simple text file, parse them as a text file, and create an RDD of the label point that will be used for the classification and regression analysis. Also specify the input source and number of partition. Adjust the number of partition based on your dataset size. Here the number of partition has been set to 2:

String input = "heart_diseases/processed_cleveland.data"; 
Dataset<Row> my_data = spark.read().format("com.databricks.spark.csv").load(input); 
my_data.show(false); 
RDD<String> linesRDD = spark.sparkContext().textFile(input, 2); 

Since JavaRDD cannot be created directly from the text files, we have created a simple RDDs so that we can convert them to JavaRDD when necessary. Now let's create the JavaRDD with a Label Point. However, we first need to convert the RDD to JavaRDD to serve our purpose as follows:

JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { 
      @Override 
  public LabeledPoint call(String row) throws Exception { 
      String line = row.replaceAll("\?", "999999.0"); 
      String[] tokens = line.split(","); 
      Integer last = Integer.parseInt(tokens[13]); 
      double[] features = new double[13]; 
      for (int i = 0; i < 13; i++) { 
      features[i] = Double.parseDouble(tokens[i]); 
      } 
      Vector v = new DenseVector(features); 
      Double value = 0.0; 
      if (last.intValue() > 0) 
        value = 1.0; 
      LabeledPoint lp = new LabeledPoint(value, v); 
    return lp; 
      } 
    }); 

Using the replaceAll() method, we have handled the invalid values such as the missing values that are specified in the original file using the ? character. To get rid of the missing or invalid values we have replaced them with a very large value that has no side effect to the original classification or predictive results. The reason for this is that missing or sparse data can lead you to highly misleading results.

Step 4: Splitting the RDD of the label point into training and test sets

In the previous step, we created RDD label point data that can be used for the regression or classification task. Now we need to split the data into training and test sets as follows:

double[] weights = {0.7, 0.3}; 
long split_seed = 12345L; 
JavaRDD<LabeledPoint>[] split = data.randomSplit(weights, split_seed); 
JavaRDD<LabeledPoint> training = split[0]; 
JavaRDD<LabeledPoint> test = split[1]; 

If you look at the preceding code segments, you will find that we have split the RDD label point as 70% for the training and 30% for the test set. The randomSplit() method performs this split. Note that we have set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. The split seed value is a long integer that signifies that the split would be random, but the result would not be a change in each run or iteration during the model building or training.

Step 5: Train the model

First we will train the linear regression model, which is the simplest regression classifier:

final double stepSize = 0.0000000009; 
final int numberOfIterations = 40;  
LinearRegressionModel model = LinearRegressionWithSGD.train(JavaRDD.toRDD(training), numberOfIterations, stepSize); 

As you can see, the preceding code trains a linear regression model with no regularization using the Stochastic Gradient Descent. This solves the least squares regression formulation f (weights) = 1/n ||A weights-y||^2^, which is the mean squared error. Here the data matrix has n rows, and the input RDD holds the set of rows of A, each with its corresponding right-hand side label y. Also, to train the model, it takes the training set, number of iterations, and the step size. We provide some random values for the last two parameters here.

Step 6: Model saving for future use

Now let's save the model that we just created for future use. It's pretty simple - just use the following code by specifying the storage location as follows:

String model_storage_loc = "models/heartModel";   
model.save(spark.sparkContext(), model_storage_loc); 

Once the model is saved in your desired location, you will see the following output in your Eclipse console:

Developing a heart diseases predictive model

Figure 9: The log after the model is saved to the storage

Step 7: Evaluate the model with a test set

Now let's calculate the prediction score on the test dataset:

JavaPairRDD<Double,Double> predictionAndLabel = 
  test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { 
            @Override 
    public Tuple2<Double, Double> call(LabeledPoint p) { 
       return new Tuple2<>(model.predict(p.features()), p.label()); 
            } 
          });   

Predict the accuracy of the prediction:

double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { 
          @Override 
          public Boolean call(Tuple2<Double, Double> pl) { 
            return pl._1().equals(pl._2()); 
          } 
        }).count() / (double) test.count(); 
System.out.println("Accuracy of the classification: "+accuracy);   

The output appears as follows:

Accuracy of the classification: 0.0 

Step 8: Predictive analytics using a different classifier

Unfortunately, there is no prediction accuracy at all, right? There might be several reasons for that, including the following:

  • The dataset characteristic
  • Model selection
  • Parameters selection - also called hyperparameter tuning

For simplicity, we assume the dataset is okay since, as we have already said, it is a widely used dataset used for machine learning research used by many researchers around the globe. Now, what next? Let's consider another classifier algorithm, for example, a Random forest or decision tree classifier. What about the Random forest? Let's go for the random forest classifier at second place. Just use the following code to train the model using the training set:

Integer numClasses = 26; //Number of classes 

Now use the HashMap to restrict the delicacy in the tree construction:

HashMap<Integer, Integer> categoricalFeaturesInfo = new HashMap<Integer, Integer>(); 

Now declare the other parameters needed to train the Random Forest classifier:

Integer numTrees = 5; // Use more in practice 
String featureSubsetStrategy = "auto"; // Let algorithm choose the best 
String impurity = "gini"; // info. gain & variance also available 
Integer maxDepth = 20; // set the value of maximum depth accordingly 
Integer maxBins = 40; // set the value of bin accordingly 
Integer seed = 12345; //Setting a long seed value is recommended       
final RandomForestModel model = RandomForest.trainClassifier(training, numClasses,categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed); 

We believe the parameters used by the trainClassifier() method are self-explanatory and so we'll leave it to the readers to get to know the significance of each parameter. Fantastic! We have trained the model using the Random forest classifier and managed the cloud to save the model for future use. Now if you reuse the same code that we described in the Evaluate the model with test set step, you should have the following output:

Accuracy of the classification: 0.7843137254901961  

Now the predictive accuracy should be much better. If you are still not satisfied, you can try with another classifier model such as the Naive Bayes classifier and carry out the hyperparameter tuning discussed in Chapter 7, Tuning Machine Learning Models.

Step 9: Making the model adaptable for a new dataset

We already mentioned that we have saved the model for future use, now we should take the opportunity to use the same model for new datasets. The reason, if you recall the steps, is that we have trained the model using the training set and evaluated it using the test set. Now, if you have more data or new data available to be used, what will you do? Will you go for re-training the model? Of course not, since you will have to iterate several steps and you will have to sacrifice valuable time and cost too.

Therefore, it would be wise to use the already trained model and predict the performance on a new dataset. Well, now let's reuse the stored model. Note that you will have to reuse the same model that is to be trained for the same model. For example, if you have done the model training using the Random forest classifier and saved the model while reusing it, you will have to use the same classifier model to load the saved model. Therefore, we will use the Random forest to load the model while using the new dataset. Use the following code to do that. Now create an RDD label point from the new dataset (that is, the Hungarian database with the same 14 attributes):

String new_data = "heart_diseases/processed_hungarian.data"; 
RDD<String> linesRDD = spark.sparkContext().textFile(new_data, 2); 
JavaRDD<LabeledPoint> data = linesRDD.toJavaRDD().map(new Function<String, LabeledPoint>() { 
      @Override 
  public LabeledPoint call(String row) throws Exception { 
  String line = row.replaceAll("\?", "999999.0"); 
  String[] tokens = line.split(","); 
  Integer last = Integer.parseInt(tokens[13]); 
    double[] features = new double[13]; 
             for (int i = 0; i < 13; i++) { 
          features[i] = Double.parseDouble(tokens[i]); 
                } 
      Vector v = new DenseVector(features); 
      Double value = 0.0; 
      if (last.intValue() > 0) 
        value = 1.0; 
      LabeledPoint p = new LabeledPoint(value, v); 
      return p; 
      } }); 

Now let's load the saved model using the Random forest model algorithm as follows:

RandomForestModel model2 =  
RandomForestModel.load(spark.sparkContext(), model_storage_loc); 

Now let's calculate the prediction on the test set:

JavaPairRDD<Double, Double> predictionAndLabel = 
  data.mapToPair(new PairFunction<LabeledPoint, Double, Double>() { 
          @Override 
          public Tuple2<Double, Double> call(LabeledPoint p) { 
      return new Tuple2<>(model2.predict(p.features()), p.label()); 
            } 
          }); 

Now calculate the accuracy of the prediction as follows:

double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() { 
          @Override 
          public Boolean call(Tuple2<Double, Double> pl) { 
            return pl._1().equals(pl._2()); 
          } 
        }).count() / (double) data.count(); 
System.out.println("Accuracy of the classification: "+accuracy);   

We should have the following output:

Accuracy of the classification: 0.9108910891089109 

Now train the Naïve Bayesian classifier and see the predictive performance. Just download the source code for the Naive Bayesian classifier and run the code as a Maven-friendly project using the pom.xml file that includes all the dependencies of the required JARs and APIs.

The following table shows a comparison of the predictive accuracies among three classifiers (that is, Linear Regression, Random Forest, and the Naive Bayesian classifier). Note that, depending upon the training, the model you get might have different output since we randomly split the dataset into training and testing:

Classifier

Model building time

Model saving time

Accuracy

Linear regression

1199 ms

2563 ms

0.0%

Naïve Bayes

873 ms

2514 ms

45%

Random forest

2120 ms

2538 ms

91%

Table 2: Comparison between three classifiers

Note

We avail the preceding output in a machine with a Windows 7(64-bit), Core i7 (2.90GHz) processor, and 32GB of main memory. Therefore, depending upon your OS type and hardware configuration, you might receive different results.

This way the ML model can be made adaptable for the new data type. However, make sure that you use the same classifier or regressor to train and reuse the model to make the ML application adaptable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset