Working with bigger data – online algorithms and out-of-core learning

If you executed the code examples in the previous section, you may have noticed that it could be computationally quite expensive to construct the feature vectors for the 50,000 movie review dataset during grid search. In many real-world applications it is not uncommon to work with even larger datasets that may even exceed our computer's memory. Since not everyone has access to supercomputer facilities, we will now apply a technique called out-of-core learning that allows us to work with such large datasets.

Back in Chapter 2, Training Machine Learning Algorithms for Classification, we introduced the concept of stochastic gradient descent, which is an optimization algorithm that updates the model's weights using one sample at a time. In this section, we will make use of the partial_fit function of the SGDClassifier in scikit-learn to stream the documents directly from our local drive and train a logistic regression model using small minibatches of documents.

First, we define a tokenizer function that cleans the unprocessed text data from our movie_data.csv file that we constructed in the beginning of this chapter and separates it into word tokens while removing stop words.

>>> import numpy as np
>>> import re
>>> from nltk.corpus import stopwords
>>> stop = stopwords.words('english')
>>> def tokenizer(text):
...     text = re.sub('<[^>]*>', '', text)
...     emoticons = re.findall('(?::|;|=)(?:-)?(?:)|(|D|P)',
...                            text.lower())
...     text = re.sub('[W]+', ' ', text.lower()) 
...            + ' '.join(emoticons).replace('-', '')
...     tokenized = [w for w in text.split() if w not in stop]
...     return tokenized

Next we define a generator function, stream_docs, that reads in and returns one document at a time:

>>> def stream_docs(path):
...    with open(path, 'r') as csv:
...        next(csv) # skip header
...        for line in csv:
...            text, label = line[:-3], int(line[-2])
...            yield text, label

To verify that our stream_docs function works correctly, let us read in the first document from the movie_data.csv file, which should return a tuple consisting of the review text as well as the corresponding class label:

>>> next(stream_docs(path='./movie_data.csv'))
('"In 1974, the teenager Martha Moxley ... ',1)

We will now define a function, get_minibatch, that will take a document stream from the stream_docs function and return a particular number of documents specified by the size parameter:

>>> def get_minibatch(doc_stream, size):
...     docs, y = [], []
...         try:
...             for _ in range(size):
...                 text, label = next(doc_stream)
...                 docs.append(text)
...                 y.append(label)
...         except StopIteration:
...             return None, None
...         return docs, y

Unfortunately, we can't use the CountVectorizer for out-of-core learning since it requires holding the complete vocabulary in memory. Also, the TfidfVectorizer needs to keep the all feature vectors of the training dataset in memory to calculate the inverse document frequencies. However, another useful vectorizer for text processing implemented in scikit-learn is HashingVectorizer. HashingVectorizer is data-independent and makes use of the Hashing trick via the 32-bit MurmurHash3 algorithm by Austin Appleby (

>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> from sklearn.linear_model import SGDClassifier
>>> vect = HashingVectorizer(decode_error='ignore', 
...                          n_features=2**21,
...                          preprocessor=None, 
...                          tokenizer=tokenizer)
>>> clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
>>> doc_stream = stream_docs(path='./movie_data.csv')

Using the preceding code, we initialized HashingVectorizer with our tokenizer function and set the number of features to Working with bigger data – online algorithms and out-of-core learning. Furthermore, we reinitialized a logistic regression classifier by setting the loss parameter of the SGDClassifier to log—note that, by choosing a large number of features in the HashingVectorizer, we reduce the chance to cause hash collisions but we also increase the number of coefficients in our logistic regression model.

Now comes the really interesting part. Having set up all the complementary functions, we can now start the out-of-core learning using the following code:

>>> import pyprind
>>> pbar = pyprind.ProgBar(45)
>>> classes = np.array([0, 1])
>>> for _ in range(45):
...     X_train, y_train = get_minibatch(doc_stream, size=1000)
...     if not X_train:
...         break
...     X_train = vect.transform(X_train)
...     clf.partial_fit(X_train, y_train, classes=classes)
...     pbar.update()
0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 50.063 sec

Again, we made use of the PyPrind package in order to estimate the progress of our learning algorithm. We initialized the progress bar object with 45 iterations and, in the following for loop, we iterated over 45 minibatches of documents where each minibatch consists of 1,000 documents each.

Having completed the incremental learning process, we will use the last 5,000 documents to evaluate the performance of our model:

>>> X_test, y_test = get_minibatch(doc_stream, size=5000)
>>> X_test = vect.transform(X_test)
>>> print('Accuracy: %.3f' % clf.score(X_test, y_test))
Accuracy: 0.868

As we can see, the accuracy of the model is 87 percent, slightly below the accuracy that we achieved in the previous section using the grid search for hyperparameter tuning. However, out-of-core learning is very memory-efficient and took less than a minute to complete. Finally, we can use the last 5,000 documents to update our model:

>>> clf = clf.partial_fit(X_test, y_test)

If you are planning to continue directly with Chapter 9, Embedding a Machine Learning Model into a Web Application, I recommend you to keep the current Python session open. In the next chapter, will use the model that we just trained to learn how to save it to disk for later use and embed it into a web application.


Although the bag-of-words model is still the most commonly used model for text classification, it does not consider sentence structure and grammar. A popular extension of the bag-of-words model is Latent Dirichlet allocation, which is a topic model that considers the latent semantics of words (D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. The Journal of machine Learning research, 3:993–1022, 2003).

A more modern alternative to the bag-of-words model is word2vec, an algorithm that Google released in 2013 (T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781, 2013). The word2vec algorithm is an unsupervised learning algorithm based on neural networks that attempts to automatically learn the relationship between words. The idea behind word2vec is to put words that have similar meanings into similar clusters; via clever vector-spacing, the model can reproduce certain words using simple vector math, for example, king – man + woman = queen.

The original C-implementation, with useful links to the relevant papers and alternative implementations, can be found at

