Solving an easy problem first

As we have seen, when we looked at our tweet data, the tweets are not only positive or negative. The majority of the tweets actually do not contain any sentiments, but are neutral or irrelevant, containing, for instance, raw information (for example, New book: Building Machine Learning ... http://link). This leads to four classes. To not complicate the task too much, let's only focus on the positive and negative tweets for now:

>>> # first create a Boolean list having true for tweets
>>> # that are either positive or negative
>>> pos_neg_idx = np.logical_or(Y_orig=="positive", Y_orig =="negative")
    
>>> # now use that index to filter the data and the labels
>>> X = X_orig [pos_neg_idx]
>>> Y = Y_orig [pos_neg_idx]
    
>>> # finally convert the labels themselves into Boolean
>>> Y = Y=="positive"

Now, we have the raw tweet text in X and the binary classification in Y, 0 for negative and 1 for positive tweets.

We just said that we will use word-occurrence counts as features. We will not use them in their raw form, though. Instead, we will use TfidfVectorizer to convert the raw tweet text into TF-IDF feature values, which we then use together with the labels to train our first classifier. For convenience, we will use the Pipeline class, which allows us to hook the vectorizer and the classifier together and provides the same interface:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
    
def create_ngram_model(params=None):
    tfidf_ngrams = TfidfVectorizer(ngram_range=(1, 3), 
                                   analyzer="word", binary=False)
    clf = MultinomialNB()
    pipeline = Pipeline([('tfidf', tfidf_ngrams), ('clf', clf)])
    if params:
        pipeline.set_params(**params)
    return pipeline

The Pipeline instance returned by create_ngram_model() can now be used to fit and predict as if we had a normal classifier. Later, we will pass a dictionary of parameters as params, which will help us to create custom pipelines.

Since we do not have that much data, we should do cross-validation. This time, however, we will not use KFold, which partitions the data in consecutive folds; instead, we'll use ShuffleSplit. It shuffles the data for us but does not prevent the same data instance from being in multiple folds. For each fold, then, we keep track of the area under the Precision-Recall curve and for accuracy.

To keep our experimentation agile, let's wrap everything together in a train_model() function, which takes a function as a parameter that creates the classifier:

from sklearn.metrics import precision_recall_curve, auc
from sklearn. model_selection import ShuffleSplit
    
def train_model(clf_factory, X, Y):
    # setting random_state to get deterministic behavior
    cv = ShuffleSplit(n_splits=10, test_size=0.3, 
                       random_state=0)
    
    scores = []
    pr_scores = []
    
    for train, test in cv.split(X, Y):
        X_train, y_train = X[train], Y[train]
        X_test, y_test = X[test], Y[test]

        clf = clf_factory()
        clf.fit(X_train, y_train)
    
        train_score = clf.score(X_train, y_train)
        test_score = clf.score(X_test, y_test)
    
        scores.append(test_score)
        proba = clf.predict_proba(X_test)
    
        precision, recall, pr_thresholds = precision_recall_curve(y_test, proba[:,1])
    
        pr_scores.append(auc(recall, precision))
    
        summary = (np.mean(scores), np.mean(pr_scores))
        print("Mean acc=%.3ftMean P/R AUC=%.3f" % summary)

Putting everything together, we can train our first model:

 >>> X_orig, Y_orig = load_sanders_data()
 >>> pos_neg_idx = np.logical_or(Y_orig =="positive", Y_orig =="negative")
 >>> X = X_orig[pos_neg_idx]
 >>> Y = Y_orig [pos_neg_idx]
 >>> Y = Y_orig =="positive"
 >>> train_model(create_ngram_model, X, Y)
 Mean acc=0.777    Mean P/R AUC=0.885

With our first try using Naïve Bayes on vectorized TF-IDF trigram features, we get an accuracy of 77.7% and an average P/R AUC of 88.5%. Looking at the P/R chart of the median (the train/test split that is performing most similarly to the average), it shows a much more encouraging behavior than the plots we've seen in the previous chapter. Please note that the AUC of the plot of 0.90 is slightly different than the mean P/R of 0.885, since the plot is taken from the median training run, whereas the mean P/R AUC averages over all AUC scores. The same principle applies for subsequent images:

For a start, the results are quite encouraging. They get even more impressive when we realize that 100% accuracy is probably never achievable in a sentiment-classification task. For some tweets, even humans often do not really agree on the same classification label.

Table of Contents for Solving an easy problem first

Create new playlist

Sign In

Sign Up

Table of Contents for
Solving an easy problem first