Measuring the classifier's performance

We have to be clear about what we want to measure. The naïve, but easiest, way is to simply calculate the average prediction quality over the test set. This will result in a value between 0 for predicting everything wrongly and 1 for perfect prediction.

Let's for now use the accuracy as the prediction quality, which scikit-learn conveniently calculates for us with knn.score(). But as we learned in Chapter 2, Classifying with Real-world Examples, we will not do it just once, but apply cross-validation here using the readymade KFold class from sklearn.model_selection. Finally, we will average the scores on the test set of each fold and see how much it varies using standard deviation:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFoldscores = []
N_FOLDS = 10
cv = KFold(n_splits=N_FOLDS, shuffle=True, random_state=0)

for train, test in cv.split(X, Y):
X_train, y_train = X[train], Y[train]
X_test, y_test = X[test], Y[test]
clf =KNeighborsClassifier()
clf.fit(X_train, Y)_train)
scores.append(clf.score(X_test, y_test))

print("Mean(scores)=%.5f Stddev(scores)=%.5f"
%(np.mean(scores), np.std(scores)))

Here is the output:

Mean(scores)=0.50170 Stddev(scores)=0.01243 

Now, that is far from being usable. With only 50% accuracy, it is like tossing a coin. Apparently, the number of links in a post is not a very good indicator of the quality of a post. So, we can say that this feature does not have much discriminative power—at least not for kNN with k=5.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset