Tuning the classifier's parameters

Certainly, we have not explored the current setup enough and should investigate more. There are roughly two areas where we can play with the knobs: TfidfVectorizer and MultinomialNB. As we have no real intuition in which area we should explore, let's try to sweep the hyperparameters.

We will see the TfidfVectorizer parameter first:

  • Using different settings for ngrams:
    • unigrams (1,1)
    • unigrams and bigrams (1,2)
    • unigrams, bigrams, and trigrams (1,3)
  • Playing with min_df: 1 or 2
  • Exploring the impact of IDF within TF-IDF using use_idf and smooth_idf: False or True
  • Whether to remove stop words or not, by setting stop_words to english
    or None
  • Whether to use the logarithm of the word counts (sublinear_tf)
  • Whether to track word counts or simply track whether words occur or not, by setting binary to True or False

Now we will see the MultinomialNB classifier:

  • Which smoothing method to use by setting alpha
  • Add-one or Laplace smoothing: 1
  • Lidstone smoothing: 0.01, 0.05, 0.1, or 0.5
  • No smoothing: 0

A simple approach could be to train a classifier for all those reasonable exploration values, while keeping the other parameters constant and check the classifier's results. As we do not know whether those parameters affect each other, doing it right will require that we train a classifier for every possible combination of all parameter values. Obviously, this is too tedious for us.

Because this kind of parameter exploration occurs frequently in machine learning tasks, scikit-learn has a dedicated class for it, called GridSearchCV. It takes an estimator (instance with a classifier-like interface), which will be the Pipeline instance in our case, and a dictionary of parameters with their potential values.

GridSearchCV expects the dictionary's keys to obey a certain format so that it is able to set the parameters of the correct estimator. The format is as follows:

<estimator>__<subestimator>__...__<param_name> 

For example, if we want to specify the desired values to explore for the ngram_range parameter of TfidfVectorizer (named tfidf in the Pipeline description), we would have to say:

param_grid={"tfidf__ngram_range"=[(1, 1), (1, 2), (1, 3)]}  

This will tell GridSearchCV to try out unigrams to trigrams as parameter values for the ngram_range parameter of TfidfVectorizer.

Then, it trains the estimator with all possible parameter-value combinations. We make sure that it trains on random samples of the training data using ShuffleSplit, which generates an iterator of random train/test splits. Finally, it provides the best estimator in the form of the member variable, best_estimator_.

As we want to compare the returned best classifier with our current best one, we need to evaluate it in the same way. Therefore, we can pass the ShuffleSplit instance using the cv parameter (therefore, CV in GridSearchCV).

The last missing piece is to define how GridSearchCV should determine the best estimator. This can be done by providing the desired score function to the scoring parameter, using the make_scorer helper function. We can either write one ourselves, or pick one from the sklearn.metrics package. We should certainly not take metric.accuracy because of our class imbalance (we have a lot fewer tweets containing sentiments than neutral ones). Instead, we want to have good precision and recall on both classes, tweets with sentiment and tweets without positive or negative opinions. One metric that combines both precision and recall is the F-measure, which is implemented as metrics.f1_score:

After putting everything together, we get the following code:

from sklearn. model_selection import GridSearchCV
from sklearn.metrics import make_scorer, f1_score

def grid_search_model(clf_factory, X, Y):
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

param_grid = dict(tfidf__ngram_range=[(1, 1), (1, 2), (1, 3)],
tfidf__min_df=[1, 2],
tfidf__stop_words=[None, "english"],
tfidf__smooth_idf=[False, True],
tfidf__use_idf=[False, True],
tfidf__sublinear_tf=[False, True],
tfidf__binary=[False, True],
clf__alpha=[0, 0.01, 0.05, 0.1, 0.5, 1],
)

grid_search = GridSearchCV(clf_factory(),
param_grid=param_grid, cv=cv,
scoring=make_scorer(f1_score), verbose=10)
grid_search.fit(X, Y)

return grid_search.best_estimator_

We have to be patient while executing this:

print("== Pos/neg vs. irrelevant/neutral ==")
X = X_orig
Y = tweak_labels(Y_orig, ["positive", "negative"])
clf = grid_search_model(create_ngram_model, X, Y)
print(clf)  

Since we have just requested a parameter, sweep over parameter combinations, each being trained on 10 folds:

... waiting some 20 minutes  ...
Pipeline(memory=None,
    steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=True,  
       decode_error='strict',
       dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
       lowercase=True, max_df=1.0, max_features=None, min_df=1,
       ngram_range=(1, 2), norm='l2', preprocessor=None, 
smooth_idf=False, vocabulary=None)), 
('clf', MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))])  

To be able to compare the numbers with our previous approach, we will create a best_params dictionary, which we will then pass to the classifier factory, and then run the same code as before that trains on 10-fold CV splits and outputs the mean scores:

best_params = dict(all__tfidf__ngram_range=(1, 2),
                   all__tfidf__min_df=1,
                   all__tfidf__stop_words=None,
                   all__tfidf__smooth_idf=False,
                   all__tfidf__use_idf=False,
                   all__tfidf__sublinear_tf=True,
                   all__tfidf__binary=False,
                   clf__alpha=0.01,
                )
print("== Pos/neg vs. irrelevant/neutral ==")
X = X_orig
Y = tweak_labels(Y_orig, ["positive", "negative"])
train_model(lambda: create_ngram_model(best_params), X, Y)  

Here are the results:

== Pos/neg vs. irrelevant/neutral ==
Mean acc=0.791    Mean P/R AUC=0.681  

The best estimator indeed improves the P/R AUC from 65.8% to 68.1%, with the settings shown in the previous code.

Also, the devastating results for positive tweets against the rest and negative tweets against the rest improve if we configure the vectorizer and classifier with those parameters we have just found out. Only the positive versus negative classification shows slightly inferior performance:

Have a look at the following plots:

Indeed, the P/R curves look much better (note that the plots are from the medium of the fold classifiers, thus, the slightly-diverging AUC values). Nevertheless, we probably still wouldn't use those classifiers. Time for something completely different...

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset