Training a sentiment classifier for movie reviews

We will now look at classifying sentiments in the movie reviews corpus in NLTK. The complete Jupyter Notebook for this example is available at Chapter02/01_example.ipynb, in the book's code repository.

First, we will load the movie reviews based on the sentiment categories, which are either positive or negative, using the following code:

cats = movie_reviews.categories()
reviews = []
for cat in cats:
    for fid in movie_reviews.fileids(cat):
        review = (list(movie_reviews.words(fid)),cat)
        reviews.append(review)
random.shuffle(reviews)

The categories() function returns either pos or neg, for positive and negative sentiments, respectively. There are 1,000 reviews in each of the positive and negative categories. We use the Python random.shuffle() function to convert the grouped positive and negative reviews into a random order. Next, we will select the top words in the reviews, and will use that as a base vocabulary for feature engineering or extraction, using the following code:

all_wd_in_reviews = nltk.FreqDist(wd.lower() for wd in movie_reviews.words())
top_wd_in_reviews = [list(wds) for wds in zip(*all_wd_in_reviews.most_common(2000))][0]

We have selected the top 2,000 words in the review to generate the features. We generate binary features from these words, based on their presence or absence in the review. Therefore, there will be 2000 features for each training set, with 1 at positions where the corresponding word is present in the review, and 0 otherwise:

def ext_ft(review,top_words):
    review_wds = set(review)
    ft = {}
    for wd in top_words:
        ft['word_present({})'.format(wd)] = (wd in review_wds)
    return ft

Each movie review is passed to the ext_ft() function, which returns the binary features in a dictionary. We will pass each review to the ext_ft() function, as shown in the following code:

featuresets = [(ext_ft(d,top_wd_in_reviews), c) for (d,c) in reviews]
train_set, test_set = featuresets[200:], featuresets[:200]

We have also split the labeled data into an 80% training and 20% test set. As an initial test, we will use the simple Naive Bayes classifier that comes with NLTK, as shown in the following code:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


Output
0.805

Even with the simple Naive Bayes classifier, we can achieve about 80% accuracy for this dataset. We can also look at the most informative features that were learned by the classification model. Here, we will view the top 20 most important features, using the following code:

classifier.show_most_informative_features(10)

The show_most_informative_features function outputs the relevant features (with the number of top features as an argument), as shown in the following code:

Output:

Most Informative Features
word_present(seagal) = True              neg : pos    =     12.9 : 1.0
word_present(outstanding) = True         pos : neg    =     10.2 : 1.0
word_present(mulan) = True               pos : neg    =      7.0 : 1.0
word_present(wonderfully) = True         pos : neg    =      6.5 : 1.0
word_present(damon) = True               pos : neg    =      5.7 : 1.0
word_present(ridiculous) = True          neg : pos    =      5.6 : 1.0
word_present(awful) = True               neg : pos    =      5.6 : 1.0
word_present(lame) = True                neg : pos    =      5.5 : 1.0
word_present(era) = True                 pos : neg    =      5.4 : 1.0
word_present(waste) = True               neg : pos    =      5.3 : 1.0

It looks like some of the words learned by the model, such as waste, awful, and ridiculous, convey negative connotations. Similarly, words such as outstanding, wonderfully, and era convey positive sentiments.

We will now evaluate with sklearn's random forest classifier model with the movie reviews data. But before that, we will vectorize the features using DictVectorizer, as we did in the previous section for training a POS tagger, as shown in the following code:

d_vect=None
def get_train_test(tr_set,te_set):
    global d_vect
    d_vect = DictVectorizer(sparse=False)
    X_tr, y_tr = zip(*tr_set)
    X_tr = d_vect.fit_transform(X_tr)
    X_te,y_te = zip(*te_set)
    X_te = d_vect.transform(X_te)
    return X_tr,X_te,y_tr,y_te

The tr_set and te_set are the training and test set instances that we obtained previously. The get_train_test function returns the vectorized features that can be passed to sklearn's random forest classifier, as shown in the following code:

X_train,X_test,y_train,y_test = get_train_test(train_set,test_set)
rf = RandomForestClassifier(n_estimators=100,n_jobs=4,random_state=10)
rf.fit(X_train,y_train)

Here, we used 100 estimators, or decision trees, for the classifiers. The n_jobs parameter is the number of parallel jobs, for faster training and prediction:

preds = rf.predict(X_test)
print(accuracy_score(y_test,preds))

Output
0.81

The accuracy (of around 81%) is a slight improvement from the Naive Bayes classifier. We will now remove all of the stop words in the reviews, and will again train the classifier to observe if there is any improvement in the model accuracy. We utilize the NLTK stop words corpora to remove the stop words. Just like earlier, we will select the top 2,000 words, as shown in the following code:

from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
all_words_in_reviews = nltk.FreqDist(word.lower() for word in movie_reviews.words() if word not in stopwords_list)
top_words_in_reviews = [list(words) for words in zip(*all_words_in_reviews.most_common(2000))][0]

top_words_in_reviews now excludes the stop words. Again, we will generate the features using this as our vocabulary and train a random forest classifier:

preds = rf.predict(X_test)
print(accuracy_score(y_test,preds))
0.76

The stop word removal for this dataset has not improved the model's accuracy, but has, in fact, reduced it. We can look at the most informative feature, as we did for the Naive Bayes classifier, by using the following code:

features_list = zip(dict_vectorizer.get_feature_names(),rf.feature_importances_)
features_list = sorted(features_list, key=lambda x: x[1], reverse=True)
print(features_list[0:20])

Just like before, we have sorted the features based on their importance, learned by the random forest classifier. We will print the top 20 from the sorted Python list, features_list:

[('word_present(bad)', 0.012904816953952729), ('word_present(boring)', 0.006797056379259946), ('word_present(stupid)', 0.006742453545126172), ('word_present(awful)', 0.00605732124427093), ('word_present(worst)', 0.005618499631730539), ('word_present(waste)', 0.005091242651240423), ('word_present(supposed)', 0.005019844359438753), ('word_present(excellent)', 0.005002846831984908), ('word_present(mess)', 0.004735341799753426), ('word_present(wasted)', 0.004477280752464545), ('word_present(ridiculous)', 0.00435578373608493), ('word_present(lame)', 0.00404257877140679), ('word_present(also)', 0.003663095965733155), ('word_present(others)', 0.0035194019538410553), ('word_present(dull)', 0.003464806019875671), ('word_present(plot)', 0.0034406946286116035), ('word_present(nothing)', 0.0033285487918061265), ('word_present(performances)', 0.003286015291474251), ('word_present(outstanding)', 0.0032708132090801516), ('word_present(memorable)', 0.003265718932501386)]

Similar to the Naive Bayes classifier, we can find words that convey positive and negative sentiments. While binary features might be useful for rudimentary text classification tasks, they are not suitable for more complex text classification applications. We will look at better feature extraction techniques in the next section.

Table of Contents for Training a sentiment classifier for movie reviews

Create new playlist

Sign In

Sign Up

Table of Contents for
Training a sentiment classifier for movie reviews