The other important use case for big data is machine learning. Specially with Hadoop, scikit-learn is more important, as this is one of the best options we have to score a machine learning model on big data. Large-scale machine learning is currently one of the hottest topics, and doing this in a big data environment such as Hadoop is all the more important. Now, the two aspects of machine learning models are building a model on big data and to build model on a significantly large amount of data and scoring a significantly large amount of data.
To understand more, let's take the same example data we used in the previous table, where we have some customer comments. Now, we can build, let's say, a text classification mode using a significant training sample, and use some of the learnings from Chapter 6, Text Classification to build a Naive Bayes, SVM, or a logistic regression model on the data. While scoring, we might need to score a huge amount of data, such as customer comments. On the other hand building the model itself on big data is not possible with scikit-learn, we will require tool like spark/Mahot for that. We will take the same step-by-step approach of scoring using a pre-trained model as we did with NLTK. While building the mode on big data will be covered in the next section. For scoring using a pre-trained model specifically when we are working on a text mining kind of problem. We need two main objects (a vectorizer and modelclassifier) to be stored as a serialized pickle object.
Build an offline model using scikit on your local machine and make sure you pickle objects. For example, if I use the Naive Bayes example from Chapter 6, Text Classification, we need to store vectorizer
and clf
as pickle objects:
>>>vectorizer = TfidfVectorizer(sublinear_tf=True, min_df=in_min_df, stop_words='english', ngram_range=(1,2), max_df=in_max_df) >>>joblib.dump(vectorizer, "vectorizer.pkl", compress=3) >>>clf = GaussianNB().fit(X_train,y_train) >>>joblib.dump(clf, "classifier.pkl")
The following are the steps for creating a output table which will have all the customer comments for the entire history:
Hive script
CREATE TABLE $InputTableName ( ID String, Content String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ';
Hive script
CREATE TABLE $OutTableName ( ID String, Content String, predict String, predict_score double )
addFILE
command in Hive:add FILE vectorizer.pkl; add FILE classifier.pkl;
Classification.py
>>>import sys >>>import pickle >>>import sklearn >>>from sklearn.externals import joblib >>>clf = joblib.load('classifier.pkl') >>>vectorizer = joblib.load('vectorizer.pkl') >>>for line in sys.stdin: >>> line = line.strip() >>> id, content= line.split(' ') >>> X_test = vectorizer.transform([str(content)]) >>> prob = clf.predict_proba(X_test) >>> pred = clf.predict(X_test) >>> prob_score =prob[:,1] >>> print ' '.join([id, content,pred,prob_score])
classification.py
UDF, we have to also add this UDF to the distributed cache and then effectively, run this UDF as a TRANSFORM
function on each and every row of the table. The Hive script for this will look like this:Hive script
add FILE classification.py; INSERT OVERWRITE TABLE $OutTableName SELECT TRANSFORM (id, content) USING 'python2.7 classification.py' AS (id string, scorestringscore string ) FROM $Tablename;
ID |
Content |
Predict |
Prob_score |
---|---|---|---|
UA0001 |
"I tried calling you, The service was not up to the mark" |
Complaint |
0.98 |
UA0002 |
"Can you please update my phone no " |
No |
0.23 |
UA0003 |
"Really bad experience" |
Complaint |
0..97 |
UA0004 |
"I am looking for an iPhone " |
No |
0.01 |
So, our output table will have all the customer comments for the entire history, scores for whether they were complaints or not, and also a confidence score. We have choosen a Hive UDF for our example, but the similar process can be done through the Pig and Python steaming in a similar way as we did in NLTK.
This example was to give you a hands-on experience of how to score a machine learning model on Hive. In the next example, we will talk about how to build a machine learning/NLP model on big data.