Working with bigger data – online algorithms and out-of-core learning

If you executed the code examples in the previous section, you may have noticed that it could be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications, it is not uncommon to work with even larger datasets that can exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning, which allows us to work with such large datasets by fitting the classifier incrementally on smaller batches of the dataset.

Back in Chapter 2, Training Simple Machine Learning Algorithms for Classification, we introduced the concept of stochastic gradient descent, which is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the partial_fit function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive, and train a logistic regression model using small mini-batches of documents.

First, we define a tokenizer function that cleans the unprocessed text data from the movie_data.csv file that we constructed at the beginning of this chapter and separate it into word tokens while removing stop words:

>>> import numpy as np
>>> import re
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> def tokenizer(text):
...     text = re.sub('<[^>]*>', '', text)
...     emoticons = re.findall('(?::|;|=)(?:-)?(?:)|(|D|P)',
...                            text.lower())
...     text = re.sub('[W]+', ' ', text.lower()) 
...            + ' '.join(emoticons).replace('-', '')
...     tokenized = [w for w in text.split() if w not in stop]
...     return tokenized

Next, we define a generator function stream_docs that reads in and returns one document at a time:

>>> def stream_docs(path):
...    with open(path, 'r', encoding='utf-8') as csv:
...        next(csv) # skip header
...        for line in csv:
...            text, label = line[:-3], int(line[-2])
...            yield text, label

To verify that our stream_docs function works correctly, let's read in the first document from the movie_data.csv file, which should return a tuple consisting of the review text as well as the corresponding class label:

>>> next(stream_docs(path='movie_data.csv'))
('"In 1974, the teenager Martha Moxley ... ',1)

We will now define a function, get_minibatch, that will take a document stream from the stream_docs function and return a particular number of documents specified by the size parameter:

>>> def get_minibatch(doc_stream, size):
...     docs, y = [], []
...     try:
...         for _ in range(size):
...                 text, label = next(doc_stream)
...                 docs.append(text)
...                 y.append(label)
...     except StopIteration:
...         return None, None
...     return docs, y

Unfortunately, we can't use CountVectorizer for out-of-core learning since it requires holding the complete vocabulary in memory. Also, TfidfVectorizer needs to keep all the feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is HashingVectorizer. HashingVectorizer is data-independent and makes use of the hashing trick via the 32-bit MurmurHash3 function by Austin Appleby (https://sites.google.com/site/murmurhash/):

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> from sklearn.linear_model import SGDClassifier
>>> vect = HashingVectorizer(decode_error='ignore',
...                          n_features=2**21,
...                          preprocessor=None,
...                          tokenizer=tokenizer)
>>> clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
>>> doc_stream = stream_docs(path='movie_data.csv')

You can replace SGDClassifier(..., n_iter=1, ...) by SGDClassifier(..., max_iter=1, ...) in scikit-learn versions greater than 0.18. The n_iter parameter is used here deliberately, because scikit-learn 0.18 is still widely used.Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set the number of features to 2**21. Furthermore, we reinitialized a logistic regression classifier by setting the loss parameter of the SGDClassifier to 'log'—note that by choosing a large number of features in the HashingVectorizer, we reduce the chance of causing hash collisions, but we also increase the number of coefficients in our logistic regression model. Now comes the really interesting part. Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

>>> import pyprind
>>> pbar = pyprind.ProgBar(45)
>>> classes = np.array([0, 1])
>>> for _ in range(45):
...     X_train, y_train = get_minibatch(doc_stream, size=1000)
...     if not X_train:
...         break
...     X_train = vect.transform(X_train)
...     clf.partial_fit(X_train, y_train, classes=classes)
...     pbar.update()
0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:39

Again, we made use of the PyPrind package in order to estimate the progress of our learning algorithm. We initialized the progress bar object with 45 iterations and, in the following for loop, we iterated over 45 mini-batches of documents where each mini-batch consists of 1,000 documents. Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:

>>> X_test, y_test = get_minibatch(doc_stream, size=5000)
>>> X_test = vect.transform(X_test)
>>> print('Accuracy: %.3f' % clf.score(X_test, y_test))
Accuracy: 0.878

As we can see, the accuracy of the model is approximately 88 percent, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning is very memory efficient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model:

>>> clf = clf.partial_fit(X_test, y_test)

If you are planning to continue directly with Chapter 9, Embedding a Machine Learning Model into a Web Application, I recommend you keep the current Python session open. In the next chapter, we will use the model that we just trained to learn how to save it to disk for later use and embed it into a web application.

Note

A more modern alternative to the bag-of-words model is word2vec, an algorithm that Google released in 2013 (Efficient Estimation of Word Representations in Vector Space, T. Mikolov, K. Chen, G. Corrado, and J. Dean, arXiv preprint arXiv:1301.3781, 2013). The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationship between words. The idea behind word2vec is to put words that have similar meanings into similar clusters, and via clever vector-spacing, the model can reproduce certain words using simple vector math, for example, king – man + woman = queen.

The original C-implementation with useful links to the relevant papers and alternative implementations can be found at https://code.google.com/p/word2vec/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset