Training a bag-of-words classifier

In the previous section, we utilized simple binary features for the words in the reviews in order to learn positive and negative sentiments. A better approach would be to use latent features, such as the frequency of the words used in the text. Compared to a binary representation of the presence or absence of words, the count of the words may better capture the characteristics of the text or document. Bag-of-words is a vector representation of text. Each of the vector dimensions captures either the frequency, presence or absence, or weighted values of words in the text. A bag-of-words representation does not capture the order of the words.

The binary feature extraction that was discussed in the previous section is, therefore, a simple bag-of-words representation of text. We will now look at an example of classifying sentiments in tweets using better bag-of-words representations. The complete Jupyter Notebook for this example is available at Chapter02/03_example.ipynb, in the book's code repository. We will use the Twitter sample corpus provided in NLTK. The NLTK Twitter samples contain sentiment polarity, just like in the movie_reviews corpus:

pos_tweets = [(string, 1) for string in twitter_samples.strings('positive_tweets.json')]
neg_tweets = [(string,0) for string in twitter_samples.strings('negative_tweets.json')]
pos_tweets.extend(neg_tweets)
comb_tweets = pos_tweets
random.shuffle(comb_tweets)
tweets,labels = (zip(*comb_tweets))

Like before, we read the data from the JSON files and attach the sentiment labels. The JSON parsing and text extraction are done by NLTK, with the strings function. We attach the sentiment label, 1 ,to denote positive sentiment, and 0 to denote negative sentiment. We also shuffle the order of the positive and negative sentiments in the Python list of tweets and sentiment label tuples, using the following code:

count_vectorizer = CountVectorizer(ngram_range=
  (1,2),max_features=10000)
X = count_vectorizer.fit_transform(tweets)

We utilize the CountVectorizer, in sklearn, to generate the features. We limit the number of features to 10000. We also use both unigram and bigram features. An n-gram denotes n number of contiguous word features sampled from the text. A unigram is the usual single word feature, and a bigram is two consecutive word sequences in the text. As bigrams are two consecutive words, they can capture short word sequences or phrases in the text. In this example, as ngram_range is (1, 2), CountVectorizer will extract both unigram and bigrams features from the tweets.

We will now train the model with the tweets, after splitting it into 80% training and 20% test sets, using the following code:

rf = RandomForestClassifier(n_estimators=100,n_jobs=4,random_state=10)
rf.fit(X_train,y_train)
X_train,X_test,y_train,y_test = train_test_split(X,labels,test_size=0.2,random_state=10)

We will now evaluate the model with the test set to predict the sentiment labels, printing the accuracy score and confusion matrix:

preds = rf.predict(X_test)
print(accuracy_score(y_test,preds))
print(confusion_matrix(y_test,preds))

Output

0.758
[[796 173]
[311 720]]

The model provides an accuracy of around 75%. We will test the model with the tfidf vectorizer. tfidf is similar to the count based n-grams model, except that the counts are now weighted. It gives weights to the words, based on their appearances in all of the documents or text. This means that words more commonly used across the documents will get lower weights, compared to words appearing in specific documents:

from nltk.corpus import stopwords
tfidf = TfidfVectorizer(ngram_range=(1,2),max_features=10000)
X = tfidf.fit_transform(tweets)

Like before, we extract both unigram and bigram features from the text. We will evaluate this model based on the test data:

preds = rf.predict(X_test)
print(accuracy_score(y_test,preds))
print(confusion_matrix(y_test,preds))

Output

0.756

TfidfVectorizer, in this case, has not improved the model's accuracy. We will now remove the stop words from the tweets, using the NLTK stop words corpora:

from nltk.corpus import stopwords
tfidf = TfidfVectorizer(ngram_range=(1,2),max_features=10000, stop_words=stopwords.words('english'))
X = tfidf.fit_transform(tweets)

preds = rf.predict(X_test)
print(accuracy_score(y_test,preds))
print(confusion_matrix(y_test,preds))

Output

O.736

An evaluation of the test data shows a reduction in the model's accuracy. Removing stop words may not always improve the accuracy, and accuracy also depends on the training data. It is possible that specific stop words occur across some common phrases that are good indicators of a tweet's sentiment.

Table of Contents for Training a bag-of-words classifier

Create new playlist

Sign In

Sign Up

Table of Contents for
Training a bag-of-words classifier