Let's go back to the same discussion we had of building a machine learning/NLP model on Hadoop and the other where we score a ML model on Hadoop. We discussed second option of scoring in depth in the last section. Instead sampling a smaller data-set and scoring let’s use a larger data-set and build a large-scale machine learning model step-by-step using PySpark. I am again using the same running data with the same schema:
ID |
Comment |
Class |
---|---|---|
UA0001 |
I tried calling you, The service was not up to the mark |
1 |
UA0002 |
Can you please update my phone no |
0 |
UA0003 |
Really bad experience |
1 |
UA0004 |
I am looking for an iPhone |
0 |
UA0005 |
Can somebody help me with my password |
1 |
UA0006 |
Thanks for considering my request for |
0 |
Consider the schema for last 10 years worth of comments of the organization. Now, instead of using a small sample to build a classification model, and then using a pretrained model to score all the comments, let me give you a step-by-step example of how to build a text classification model using PySpark.
The first thing that we need to do is we need to import some of the modules. Starting with SparkContext
, which is more of a configuration, you can provide more parameters, such as app names and others for this.
>>>from pyspark import SparkContext >>>sc = SparkContext(appName="comment_classifcation")
For more information, go through the article at
http://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.context.SparkContext-class.html.
The next thing is reading a tab delimited text file. Reading the file should be on HDFS. This file could be huge (~Tb/Pb):
>>>lines = sc.textFile("testcomments.txt")
The lines are now a list of all the rows in the corpus:
>>>parts = lines.map(lambda l: l.split(" ")) >>>corpus = parts.map(lambda row: Row(id=row[0], comment=row[1], class=row[2]))
The part is a list of fields as we have each field in the line delimited on " ".
Let's break the corpus that has [ID, comment, class (0,1)] in the different RDD objects:
>>>comment = corpus.map(lambda row: " " + row.comment) >>>class_var = corpus.map(lambda row:row.class)
Once we have the comments, we need to do a process very similar to what we did in Chapter 6, Text Classification, where we used scikit to do tokenization, hash vectorizer and calculate TF, IDF, and tf-idf using a vectorizer.
The following is the snippet of how to create tokenization, term frequency, and inverse document frequency:
>>>from pyspark.mllib.feature import HashingTF >>>from pyspark.mllib.feature import IDF # https://spark.apache.org/docs/1.2.0/mllib-feature-extraction.html >>>comment_tokenized = comment.map(lambda line: line.strip().split(" ")) >>>hashingTF = HashingTF(1000) # to select only 1000 features >>>comment_tf = hashingTF.transform(comment_tokenized) >>>comment_idf = IDF().fit(comment_tf) >>>comment_tfidf = comment_idf.transform(comment_tf)
We will merge the class with the tfidf
RDD like this:
>>>finaldata = class_var.zip(comment_tfidf)
We will do a typical test, and train sampling:
>>>train, test = finaldata.randomSplit([0.8, 0.2], seed=0)
Let's perform the main classification commands, which are quite similar to scikit. We are using a logistic regression, which is widely used classifier. The pyspark.mllib
provides you with a variety of algorithms.
For more information on pyspark.mllib
visit https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html
The following is an example of Naive bayes classifier:
>>>from pyspark.mllib.regression import LabeledPoint >>>from pyspark.mllib.classification import NaiveBayes >>>train_rdd = train.map(lambda t: LabeledPoint(t[0], t[1])) >>>test_rdd = test.map(lambda t: LabeledPoint(t[0], t[1])) >>>nb = NaiveBayes.train(train_rdd,lambda = 1.0) >>>nb_output = test_rdd.map(lambda point: (NB.predict(point.features), point.label)) >>>print nb_output
The nb_output
command contains the final predictions for the test sample. The great thing to understand is that with just less than 50 lines, we built a snippet code for an industry-standard text classification with even petabytes of the training sample.