Chapter 9. Embedding a Machine Learning Model into a Web Application

In the previous chapters, you learned about the many different machine learning concepts and algorithms that can help us with better and more efficient decision-making. However, machine learning techniques are not limited to offline applications and analysis, and they can be the predictive engine of your web services. For example, popular and useful applications of machine learning models in web applications include spam detection in submission forms, search engines, recommendation systems for media or shopping portals, and many more.

In this chapter, you will learn how to embed a machine learning model into a web application that can not only classify, but also learn from data in real time. The topics that we will cover are as follows:

  • Saving the current state of a trained machine learning model
  • Using SQLite databases for data storage
  • Developing a web application using the popular Flask web framework
  • Deploying a machine learning application to a public web server

Serializing fitted scikit-learn estimators

Training a machine learning model can be computationally quite expensive, as we have seen in Chapter 8, Applying Machine Learning to Sentiment Analysis. Surely we don't want to train our model every time we close our Python interpreter and want to make a new prediction or reload our web application? One option for model persistence is Python's in-built pickle module (https://docs.python.org/3.6/library/pickle.html), which allows us to serialize and deserialize Python object structures to compact bytecode so that we can save our classifier in its current state and reload it if we want to classify new samples, without needing the model to learn from the training data all over again. Before you execute the following code, please make sure that you have trained the out-of-core logistic regression model from the last section of Chapter 8, Applying Machine Learning to Sentiment Analysis and have it ready in your current Python session:

>>> import pickle
>>> import os
>>> dest = os.path.join('movieclassifier', 'pkl_objects')
>>> if not os.path.exists(dest):
...     os.makedirs(dest)

>>> pickle.dump(stop,
...          open(os.path.join(dest, 'stopwords.pkl'),'wb'),
...          protocol=4)
>>> pickle.dump(clf,
...          open(os.path.join(dest, 'classifier.pkl'), 'wb'),
...          protocol=4)

Using the preceding code, we created a movieclassifier directory where we will later store the files and data for our web application. Within this movieclassifier directory, we created a pkl_objects subdirectory to save the serialized Python objects to our local drive. Via the dump method of the pickle module, we then serialized the trained logistic regression model as well as the stop word set from the Natural Language Toolkit (NLTK) library, so that we don't have to install the NLTK vocabulary on our server.

The dump method takes as its first argument the object that we want to pickle, and for the second argument we provided an open file object that the Python object will be written to. Via the wb argument inside the open function, we opened the file in binary mode for pickle, and we set protocol=4 to choose the latest and most efficient pickle protocol that has been added to Python 3.4, which is compatible with Python 3.4 or newer. If you have problems using protocol=4, please check whether you are using the latest Python 3 version. Alternatively, you may consider choosing a lower protocol number.

Note

Our logistic regression model contains several NumPy arrays, such as the weight vector, and a more efficient way to serialize NumPy arrays is to use the alternative joblib library. To ensure compatibility with the server environment that we will use in later sections, we will use the standard pickle approach. If you are interested, you can find more information about joblib at http://pythonhosted.org/joblib/.

We don't need to pickle HashingVectorizer, since it does not need to be fitted. Instead, we can create a new Python script file from which we can import the vectorizer into our current Python session. Now, copy the following code and save it as vectorizer.py in the movieclassifier directory:

from sklearn.feature_extraction.text import HashingVectorizer
import re
import os
import pickle

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
               os.path.join(cur_dir,
               'pkl_objects',
               'stopwords.pkl'), 'rb'))

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:)|(|D|P)',
                           text.lower())
    text = re.sub('[W]+', ' ', text.lower()) 
                  + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

After we have pickled the Python objects and created the vectorizer.py file, it would now be a good idea to restart our Python interpreter or IPython Notebook kernel to test if we can deserialize the objects without error.

Note

However, please note that unpickling data from an untrusted source can be a potential security risk, since the pickle module is not secured against malicious code. Since pickle was designed to serialize arbitrary objects, the unpickling process will execute code that has been stored in a pickle file. Thus, if you receive pickle files from an untrusted source (for example, by downloading them from the internet), please proceed with extra care and unpickle the items in a virtual environment and/or on a non-essential machine that does not store important data that no one except you should have access to.

From your Terminal, navigate to the movieclassifier directory, start a new Python session and execute the following code to verify that you can import the vectorizer and unpickle the classifier:

>>> import pickle
>>> import re
>>> import os
>>> from vectorizer import vect
>>> clf = pickle.load(open(
...        os.path.join('pkl_objects',
...                     'classifier.pkl'), 'rb'))

After we have successfully loaded the vectorizer and unpickled the classifier, we can now use these objects to preprocess document samples and make predictions about their sentiment:

>>> import numpy as np
>>> label = {0:'negative', 1:'positive'}

>>> example = ['I love this movie']
>>> X = vect.transform(example)
>>> print('Prediction: %s
Probability: %.2f%%' %
...       (label[clf.predict(X)[0]],
...        np.max(clf.predict_proba(X))*100))
Prediction: positive
Probability: 91.56%

Since our classifier returns the class labels as integers, we defined a simple Python dictionary to map these integers to their sentiment. We then used HashingVectorizer to transform the simple example document into a word vector X. Finally, we used the predict method of the logistic regression classifier to predict the class label, as well as the predict_proba method to return the corresponding probability of our prediction. Note that the predict_proba method call returns an array with a probability value for each unique class label. Since the class label with the largest probability corresponds to the class label that is returned by the predict call, we used the np.max function to return the probability of the predicted class.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset