OCR pipeline with Spark

Image processing and computer vision are two classical but still-emerging research areas that often make proper utilization of many types of machine learning algorithms. There are several use cases where the relationships of linking the patterns of image pixels to higher concepts are extremely complex and hard to define, and of course, computationally extensive, too.

From a practical point of view, it's relatively easier for a human being to recognize if an object is a face, a dog, or letters or characters. However, defining these patterns under certain circumstances is difficult. Additionally, image-related datasets are often noisy.

In this section, we will develop a model similar to those used at the core of the Optical Character Recognition (OCR) used as document scanners. This kind of software helps to process paper-based documents by converting printed or handwritten text into an electronic form to be saved in a database.

When OCR software first processes a document, it divides the paper, or any object, into a matrix such that each cell in the grid contains a single glyph (also known as different graphical shapes), which is just an elaborate way of referring to a letter, symbol, or number, or any contextual information from the paper or the object.

To demonstrate the OCR pipeline, we will assume that the document contains only alpha characters in English that match glyphs to one of the 26 letters, A to Z. We will use the OCR letter dataset from the UCI Machine Learning Data Repository (http://archive.ics.uci.edu/ml ). The dataset was donated by W. Frey and D. J. Slate et al. To explore the dataset, we have found that the dataset contains 20,000 examples of 26 English alphabet capital letters printed using 20 different randomly reshaped and distorted black-and-white fonts as glyphs of different shapes.

Tip

For more information about these data, refer to Letter recognition using Holland-style adaptive classifiers, Machine Learning, Vol. 6, pp. 161-182, by W. Frey and D.J. Slate (1991).

The image shown in Figure 19 was published by Frey and Slate and provides an example of some of the printed glyphs. Distorted in this way, the letters are challenging for a computer to identify, yet are easily recognized by a human being. The statistical attributes for the top 20 rows are shown in Figure 20:

OCR pipeline with Spark

Figure 19: Some of the printed glyphs [courtesy of the article titled Letter recognition using Holland-style adaptive classifiers, Machine Learning, Vol. 6, pp. 161-182, by W. Frey and D.J. Slate (1991)]

Exploring and preparing the data

According to the documentation provided by Frey and Slate, when the glyphs are scanned using an OCR reader to the computer they are automatically converted into pixels. Consequently, the 16 statistical attributes mentioned are recorded to the computer too.

Note that the concentration of black pixels across the various areas of the box where the character is indicated should provide a way to differentiate among the 26 letters of the alphabet using an OCR or a machine learning algorithm to be trained.

Tip

To follow along with this example, download the letterdata.data file from the Packt Publishing website and save it to your project directory working on one or the other directory.

Before reading the data from the Spark working directory, we confirm that we have received the data with the 16 features that define each example of the letter class. As expected, the letter has 26 levels, as shown in Figure 20:

Exploring and preparing the data

Figure 20: A snapshot of the dataset shown as Data Frame

Recall that SVM, Naive Baseyan-based classifier, or any other classifier algorithms, along with their associated learners, require all the features to be numeric. Moreover, each feature is scaled to a fairly small interval.

Also, SVM works well on dense vectorized features and consequently, will perform poorly against the sparse vectorized features. In our case, every feature is an integer. Therefore, we do not need to convert any values into numbers. On the other hand, some of the ranges for these integer variables appear fairly wide.

In practical cases, it might require that we normalize the data against all few features points.

OCR pipeline with Spark ML and Spark MLlib

Because of its accuracy and robustness, let's see whether the SVM is up to the task. As you can see in Figure 17, we have a multiclass OCR dataset (with 26 classes, to be more precise); therefore, we need to have a multiclass classification algorithm, for example, the logistic regression model, since the current implementation of liner SVM in Spark does not support the multi-class classification.

Step 1: Import necessary packages/libraries/APIs

The following is the code to import the necessary packages:

import java.util.HashMap; 
import java.util.Map; 
import org.apache.spark.api.java.JavaRDD; 
import org.apache.spark.api.java.function.Function; 
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS; 
import org.apache.spark.mllib.evaluation.MulticlassMetrics; 
import org.apache.spark.mllib.evaluation.MultilabelMetrics; 
import org.apache.spark.mllib.linalg.DenseVector; 
import org.apache.spark.mllib.linalg.Vector; 
import org.apache.spark.mllib.regression.LabeledPoint; 
import org.apache.spark.sql.Dataset; 
import org.apache.spark.sql.Row; 
import org.apache.spark.sql.SparkSession; 
import scala.Tuple2; 

Step 2: Initialize necessary Spark environment

The following is the code to initialize a Spark environment:

  static SparkSession spark = SparkSession 
        .builder() 
        .appName("OCRPrediction") 
            .master("local[*]") 
            .config("spark.sql.warehouse.dir", "E:/Exp/"). 
            getOrCreate(); 

Here we set the application name as OCRPrediction, and the master URL as local. The Spark session is the entry point of the program. Please set these parameters accordingly.

Step 3: Read the data file and create a corresponding Dataset and show the first 20 rows

The following is the code to read the data file:

String input = "input/letterdata.data"; 
Dataset<Row> df = spark.read().format("com.databricks.spark.csv").option("header", "true").load(input);  
  df.show();  

For the first 20 rows, please refer to Figure 5. As we can see, there are 26 characters presented as single characters that need to be predicted; therefore, we need to assign each character a random double value to align the value to the other features. Therefore, in the next step, that is what we'll do.

Step 4: Create a dictionary for assigning each character a double value randomly

The following code is to create a dictionary for assigning each character a double value randomly:

final Map<String, Integer>alpha = newHashMap(); 
    intcount = 0; 
    for(chari = 'A'; i<= 'Z'; i++){ 
      alpha.put(i + "", count++); 
      System.out.println(alpha); 
    } 

And here is the mapping output generated from the preceding code segment:

OCR pipeline with Spark ML and Spark MLlib

Figure 21: Mapping assignments

Step 5: Creating the labeled point and the feature vector

Create the labeled points and feature vectors for the features combined from the 16 features (that is, 16 columns). Also, save them as Java RDD and dump or cache on disk or memory, and show the sample output:

JavaRDD<LabeledPoint> dataRDD = df.toJavaRDD().map(new Function<Row, LabeledPoint>() { 
      @Override 
      public LabeledPoint call(Row row) throws Exception { 
       
        String letter = row.getString(0); 
        double label = alpha.get(letter); 
        double[] features= new double [row.size()]; 
        for(int i = 1; i < row.size(); i++){ 
          features[i-1] = Double.parseDouble(row.getString(i)); 
        } 
        Vector v = new DenseVector(features);         
        return new LabeledPoint(label, v); 
      } 
    }); 
     
dataRDD.saveAsTextFile("Output/dataRDD"); 
System.out.println(dataRDD.collect()); 

If you look carefully at the preceding code segments, we have created an array named features for 16 features altogether and created a dense vector representation, since the dense vector representation is a more compact representation where the contents can be shown as in the following screenshot:

OCR pipeline with Spark ML and Spark MLlib

Figure 22: The Java RDD for the corresponding label and features as vectors

Step 6: Generating the training and test set

Here is the code for generating the test set:

JavaRDD<LabeledPoint>[] splits = dataRDD.randomSplit(new double[] {0.7, 0.3}, 12345L); 
JavaRDD<LabeledPoint> training = splits[0]; 
JavaRDD<LabeledPoint> test = splits[1];  

If you wish to see the snaps of the training or test datasets, you should dump or cache them. The following is a sample code for that:

training.saveAsTextFile("Output/training"); 
test.saveAsTextFile("Output/test"); 

Tip

We have randomly generated the training and the test set for the model to be trained and tested. In our case, it was 70% and 30%, respectively, and 11L as the long seed. Readjust the values based on your dataset. Note that if you add a seed to a Random, you get the same results each time you run your code and these come as primes up to 1062348.

Step 7: Train the model

As you can see, we have a multiclass dataset with 26 classes; therefore, we need to have a multiclass classification algorithm, for example, the logistic regression model:

Boolean useFeatureScaling= true; 
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS() 
  .setNumClasses(26).setFeatureScaling(useFeatureScaling) 
  .run(training.rdd()); 

The preceding code segment builds a model using the training datasets by specifying the number of classes (that is, 26) and the feature scaling as Boolean true. As you can see, we have used the RDD version of the training datasets using training.rdd(), since the training datasets are in normal vector format.

Tip

Spark has the support of the multiclass logistic regression algorithm that supports the Limited-Memory-Broyden-Fletcher-Goldfarb-Shanno (LBFGS) algorithm. In numerical optimization, the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm is an iterative method for solving unconstrained nonlinear optimization problems.

Step 8: Compute the raw scores on the test dataset

Here is the code to compute the raw scores:

JavaRDD<Tuple2<Object, Object>> predictionAndLabels = test.map( 
    new Function<LabeledPoint, Tuple2<Object, Object>>() { 
    public Tuple2<Object, Object> call(LabeledPoint p) { 
    Double prediction = model.predict(p.features()); 
    return new Tuple2<Object, Object>(prediction, p.label()); 
          } 
        } 
      );  
predictionAndLabels.saveAsTextFile("output/prd2");  

If you look at the preceding code carefully, you will see that we are actually calculating the predicted features out of the model we created in Step 7 by making them Java RDD.

Step 9: Predict the outcome for label 8.0 (that is, I) and get the evaluation metrics

The following code illustrates how to predict the outcome:

MulticlassMetrics metrics = new MulticlassMetrics(predictionAndLabels.rdd()); 
MultilabelMetrics(predictionAndLabels.rdd()); 
System.out.println(metrics.confusionMatrix()); 
double precision = metrics.precision(metrics.labels()[0]); 
double recall = metrics.recall(metrics.labels()[0]); 
double tp = 8.0; 
double TP = metrics.truePositiveRate(tp); 
double FP = metrics.falsePositiveRate(tp); 
double WTP = metrics.weightedTruePositiveRate(); 
double WFP =  metrics.weightedFalsePositiveRate(); 
System.out.println("Precision = " + precision); 
System.out.println("Recall = " + recall); 
System.out.println("True Positive Rate = " + TP); 
System.out.println("False Positive Rate = " + FP); 
System.out.println("Weighted True Positive Rate = " + WTP); 
System.out.println("Weighted False Positive Rate = " + WFP); 
OCR pipeline with Spark ML and Spark MLlib

Figure 23: Performance metrics for precision and recall

Therefore, the precision is 75%, which is obviously not satisfactory. However, if you are still unsatisfied, the following chapter will look at how to tune parameters so that the prediction accuracy increases.

Tip

To get an idea about how to calculate the precision, recall, true positive rate, and true negative rate, please refer to the Wikipedia page at https://en.wikipedia.org/wiki/Sensitivity_and_specificity, which discusses the sensitivity and the specificity elaborately. You can also refer to Powers, David M W (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation(PDF). Journal of Machine Learning Technologies 2 (1): 37-63.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset