In the previous sections you learned to use logistic regression for binary classification. In many classification problems, however, there are more than two classes that are of interest. We might wish to predict the genres of songs from samples of audio, or classify images of galaxies by their types. The goal of multi-class classification is to assign an instance to one of the set of classes. scikit-learn uses a strategy called one-vs.-all, or one-vs.-the-rest, to support multi-class classification. One-vs.-all
classification uses one binary classifier for each of the possible classes. The class that is predicted with the greatest confidence is assigned to the instance. LogisticRegression
supports multi-class classification using the one-versus-all strategy out of the box. Let's use LogisticRegression
for a multi-class classification problem.
Assume that you would like to watch a movie, but you have a strong aversion to watching bad movies. To inform your decision, you could read reviews of the movies you are considering, but unfortunately you also have a strong aversion to reading movie reviews. Let's use scikit-learn to find the movies with good reviews.
In this example, we will classify the sentiments of phrases taken from movie reviews in the Rotten Tomatoes data set. Each phrase can be classified as one of the following sentiments: negative, somewhat negative, neutral, somewhat positive, or positive. While the classes appear to be ordered, the explanatory variables that we will use do not always corroborate this order due to sarcasm, negation, and other linguistic phenomena. Instead, we will approach this problem as a multi-class classification task.
The data can be downloaded from http://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data. First, let's explore the data set using pandas. Note that the import and data-loading statements in the following snippet are required for the subsequent snippets:
>>> import pandas as pd >>> df = pd.read_csv('movie-reviews/train.tsv', header=0, delimiter=' ') >>> print df.count() PhraseId 156060 SentenceId 156060 Phrase 156060 Sentiment 156060 dtype: int64
The columns of the data set are tab delimited. The data set contains 1,56,060 instances.
>>> print df.head() PhraseId SentenceId Phrase 0 1 1 A series of escapades demonstrating the adage ... 1 2 1 A series of escapades demonstrating the adage ... 2 3 1 A series 3 4 1 A 4 5 1 series Sentiment 0 1 1 2 2 2 3 2 4 2 [5 rows x 4 columns]
The Sentiment
column contains the response variables. The 0
label corresponds to the sentiment negative
, 1
corresponds to somewhat negative
, and so on. The Phrase
column contains the raw text. Each sentence from the movie reviews has been parsed into smaller phrases. We will not require the PhraseId
and SentenceId
columns in this example. Let's print some of the phrases and examine them:
>>> print df['Phrase'].head(10) 0 A series of escapades demonstrating the adage ... 1 A series of escapades demonstrating the adage ... 2 A series 3 A 4 series 5 of escapades demonstrating the adage that what... 6 of 7 escapades demonstrating the adage that what is... 8 escapades 9 demonstrating the adage that what is good for ... Name: Phrase, dtype: object
Now let's examine the target classes:
>>> print df['Sentiment'].describe() count 156060.000000 mean 2.063578 std 0.893832 min 0.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 4.000000 Name: Sentiment, dtype: float64 >>> print df['Sentiment'].value_counts() 2 79582 3 32927 1 27273 4 9206 0 7072 dtype: int64 >>> print df['Sentiment'].value_counts()/df['Sentiment'].count() 2 0.509945 3 0.210989 1 0.174760 4 0.058990 0 0.045316 dtype: float64
The most common class, Neutral
, includes more than 50 percent of the instances. Accuracy will not be an informative performance measure for this problem, as a degenerate classifier that predicts only Neutral
can obtain an accuracy near 0.5. Approximately one quarter of the reviews are positive or somewhat positive, and approximately one fifth of the reviews are negative or somewhat negative. Let's train a classifier with scikit-learn:
import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model.logistic import LogisticRegression from sklearn.cross_validation import train_test_split from sklearn.metrics.metrics import classification_report, accuracy_score, confusion_matrix from sklearn.pipeline import Pipeline from sklearn.grid_search import GridSearchCV def main(): pipeline = Pipeline([ ('vect', TfidfVectorizer(stop_words='english')), ('clf', LogisticRegression()) ]) parameters = { 'vect__max_df': (0.25, 0.5), 'vect__ngram_range': ((1, 1), (1, 2)), 'vect__use_idf': (True, False), 'clf__C': (0.1, 1, 10), } df = pd.read_csv('data/train.tsv', header=0, delimiter=' ') X, y = df['Phrase'], df['Sentiment'].as_matrix() X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5) grid_search = GridSearchCV(pipeline, parameters, n_jobs=3, verbose=1, scoring='accuracy') grid_search.fit(X_train, y_train) print 'Best score: %0.3f' % grid_search.best_score_ print 'Best parameters set:' best_parameters = grid_search.best_estimator_.get_params() for param_name in sorted(parameters.keys()): print ' %s: %r' % (param_name, best_parameters[param_name]) if __name__ == '__main__': main()
The following is the output of the script:
Fitting 3 folds for each of 24 candidates, totalling 72 fits [Parallel(n_jobs=3)]: Done 1 jobs | elapsed: 3.3s [Parallel(n_jobs=3)]: Done 50 jobs | elapsed: 1.1min [Parallel(n_jobs=3)]: Done 68 out of 72 | elapsed: 1.9min remaining: 6.8s [Parallel(n_jobs=3)]: Done 72 out of 72 | elapsed: 2.1min finished Best score: 0.620 Best parameters set: clf__C: 10 vect__max_df: 0.25 vect__ngram_range: (1, 2) vect__use_idf: False
As with binary classification, confusion matrices are useful for visualizing the types of errors made by the classifier. Precision, recall, and F1 score can be computed for each of the classes, and accuracy for all of the predictions can also be calculated. Let's evaluate our classifier's predictions. The following snippet continues the previous example:
predictions = grid_search.predict(X_test) print 'Accuracy:', accuracy_score(y_test, predictions) print 'Confusion Matrix:', confusion_matrix(y_test, predictions) print 'Classification Report:', classification_report(y_test, predictions)
The following will be appended to the output:
Accuracy: 0.636370626682 Confusion Matrix: [[ 1129 1679 634 64 9] [ 917 6121 6084 505 35] [ 229 3091 32688 3614 166] [ 34 408 6734 8068 1299] [ 5 35 494 2338 1650]] Classification Report: precision recall f1-score support 0 0.49 0.32 0.39 3515 1 0.54 0.45 0.49 13662 2 0.70 0.82 0.76 39788 3 0.55 0.49 0.52 16543 4 0.52 0.36 0.43 4522 avg / total 0.62 0.64 0.62 78030
First, we make predictions using the best parameter set found by using grid searching. While our classifier is an improvement over the baseline classifier, it frequently mistakes Somewhat Positive
and Somewhat Negative
for Neutral
.