Once we have the entire corpus in the form of lists, we need to perform some form of sampling. Typically, the way to sample the entire corpus in development train sets, dev-test sets, and test sets is similar to the sampling shown in the following figure.
The idea behind the whole exercise is to avoid overfitting. If we feed all the data points to the model, then the algorithm will learn from the entire corpus, but the real test of these algorithms is to perform on unseen data. In very simplistic terms, if we are using the entire data in the model learning process the classifier will perform very good on this data, but it will not be robust. The reason being, we have to tune it to perform the best on the given data, but it doesn't learn how to deal with unknown data.
To solve this kind of a problem, the best way is to divide the entire corpus into two major sets. The development set and test set are kept away for the modeling exercise. We just use the dev set to build and tune the model. Once we are done with the entire modeling exercise, the results are projected based on the test set that we put aside. Now, if the model performs well on this set, we are sure that it's accurate and robust for any new data sample.
Sampling itself is a very complicated and well-researched stream in the machine learning community, and it's a remedy for many data skewness and overfitting issues. For simplicity, will use the basic sampling, where we just divide the corpus into a split of 70:30:
>>>trainset_size = int(round(len(sms_data)*0.70)) >>># i chose this threshold for 70:30 train and test split. >>>print 'The training set size for this classifier is ' + str(trainset_size) + ' ' >>>x_train = np.array([''.join(el) for el in sms_data[0:trainset_size]]) >>>y_train = np.array([el for el in sms_labels[0:trainset_size]]) >>>x_test = np.array([''.join(el) for el in sms_data[trainset_size+1:len(sms_data)]]) >>>y_test = np.array([el for el in sms_labels[trainset_size+1:len(sms_labels)]])or el in sms_labels[trainset_size+1:len(sms_labels)]]) >>>print x_train >>>print y_train
To understand more about the available sampling techniques, go through
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.cross_validation.
Let's jump to one of the most important things, where we transform the entire text into a vector form. The form is referred to as the term-document matrix. If we have to create a term-document matrix for the given example, it will look somewhat like this:
TDM |
anymore |
call |
camera |
color |
cried |
enough |
entitled |
free |
gon |
had |
latest |
mobile |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SMS1 |
0 |
1 |
1 |
1 |
0 |
0 |
1 |
2 |
0 |
1 |
0 |
3 |
SMS2 |
1 |
0 |
0 |
0 |
1 |
1 |
0 |
0 |
1 |
0 |
0 |
0 |
The representation here of the text document is also known as the BOW (Bag of Word) representation. This is one of the most commonly used representation in text mining and other applications. Essentially, we are not considering any context between the words to generate this kind of representation.
To generate a similar term-document matrix in Python, we use scikit vectorizers:
>>>from sklearn.feature_extraction.text import CountVectorizer >>>sms_exp=[ ] >>>for line in sms_list: >>> sms_exp.append(preprocessing(line[1])) >>>vectorizer = CountVectorizer(min_df=1) >>>X_exp = vectorizer.fit_transform(sms_exp) >>>print "||".join(vectorizer.get_feature_names()) >>>print X_exp.toarray() array([[ 1, 0, 1, 1, 1, 0, 0, 1, 2, 0, 1, 0, 1, 3, 1, 0, 0, 0, 1, 0, 0, 2, 0, 0], [ 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, ]])
The count vectorizer is a good start, but there is an issue that you will face while using it: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.
Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus, and are therefore less informative than those that occur only in a smaller portion of the corpus.
This downscaling is called tf–idf (term frequency–inverse document frequency). Fortunately, scikit also provides a way to achieve the following:
>>>from sklearn.feature_extraction.text import TfidfVectorizer >>>vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_ words='english', strip_accents='unicode', norm='l2') >>>X_train = vectorizer.fit_transform(x_train) >>>X_test = vectorizer.transform(x_test)
We now have the text in a matrix format the same as we have in any machine learning exercise. Now, X_train
and X_test
can be used for classification using any machine learning algorithm. Let's talk about some of the most commonly used machine learning algorithms in context of text classification.
Let's build your first text classifier. Let's start with a Naive Bayes classifier. Naive Bayes relies on the Bayes algorithm and essentially, is a model of assigning a class label to the sample based on the conditional probability class given by features/attributes. Here we deal with frequencies/bernoulli to estimate prior and posterior probabilities.
The naive assumption here is that all features are independent of each other, which looks counter intuitive in the case of text. However, surprisingly, Naive Bayes performs quite well in most of the real-world use cases.
Another great thing about NB is that it's too simple and very easy to implement and score. We need to store the frequencies and calculate the probabilities. It's really fast in case of training as well as test (scoring). For all these reasons, in most of the cases of text classification, it serves as a benchmark.
Let's write some code to achieve this classifier:
>>>from sklearn.naive_bayes import MultinomialNB >>>clf = MultinomialNB().fit(X_train, y_train) >>>y_nb_predicted = clf.predict(X_test) >>>print y_nb_predicted >>>print ' confusion_matrix ' >>>cm = confusion_matrix(y_test, y_pred) >>>print cm >>>print ' Here is the classification report:' >>>print classification_report(y_test, y_nb_predicted) confusion_matrix [[1205 5] [26 156]]
The way to read the confusion matrix is that from all the 1,392 samples in the test set, there were 1205 true positives and 156 true negative cases. However, we also predicted 5 false negatives and 26 false positives. There are different ways of measuring a typical binary classification.
We have given definitions of some of the most common measures used in classification measures:
Here is the classification report:
Precision recall f1-score support ham 0.97 1.00 0.98 1210 spam 1.00 0.77 0.87 182 avg / total 0.97 0.97 0.97 1392
With the preceding definition, we can now understand the results clearly. So, effectively, all the preceding metrics look good, which means that our classifier is performing accurately, and is robust. I would highly recommend that you look into the module metrics for more options to analyze the results of the classifier. The most important and balanced metric is the f1
measure (which is nothing but the harmonic mean of precision and recall), which is used widely because it gives a better picture of the coverage and the quality of the classification algorithms. Accuracy intuitively tells us how many true samples have been covered from all the samples. Precision and recall both have significance, while precision talks about how many true positives it got and what else got covered, hand recall gives us details about how accurate we are from the pool of true positives and false negatives.
For more information on various scikit classes visit the following link:
http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics
The other more important process we follow to understand our model is to really look deep into the model by looking at the actual features that contribute to the positive and negative classes. I just wrote a very small snippet to generate the top n features and print them. Let's have a look at them:
>>>feature_names = vectorizer.get_feature_names() >>>coefs = clf.coef_ >>>intercept = clf.intercept_ >>>coefs_with_fns = sorted(zip(clf.coef_[0], feature_names)) >>>n = 10 >>>top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1]) >>>for (coef_1, fn_1), (coef_2, fn_2) in top: >>> print(' %.4f %-15s %.4f %-15s' % (coef_1, fn_1, coef_2, fn_2)) -9.1602 10 den -6.0396 free -9.1602 15 -6.3487 txt -9.1602 1hr -6.5067 text -9.1602 1st ur -6.5393 claim -9.1602 2go -6.5681 reply -9.1602 2marrow -6.5808 mobile -9.1602 2morrow -6.5858 stop -9.1602 2mrw -6.6124 ur -9.1602 2nd innings -6.6245 prize -9.1602 2nd ur -6.7856 www
In the preceding code, I just read all the feature names from the vectorizer, got the coefficients related to the given feature, and then printed the first-10 features. If you want more features, just modify the value of n. If we look closely just at the features, we get a lot of information about the model as well as more suggestions about our feature selection and other parameters, such as preprocessing, unigrams/bigrams, stemming, tokenizations, and so on. For example, if you look at the top features of ham you can see that 2morrow
, 2nd innings
, and some of the digits are coming very significantly. We can see on the positive class (spam ) term "free" comes out a very significant term which is intuitive while many spam messages will be about some free offers and deal. Some of the other terms to note are prize, www, claim.
For more details, refer to http://scikitlearn.org/stable/modules/naive_bayes.html.
Decision trees are one of the oldest predictive modeling techniques, where for the given features and target, the algorithm tries to build a logic tree. There are multiple algorithms that exist for decision trees. One of the most famous and widely used algorithm is CART.
CART constructs binary trees using this feature, and constructs a threshold that yields the large amount of information from each node. Let's write the code to get a CART classifier:
>>>from sklearn import tree >>>clf = tree.DecisionTreeClassifier().fit(X_train.toarray(), y_train) >>>y_tree_predicted = clf.predict(X_test.toarray()) >>>print y_tree_predicted >>>print ' Here is the classification report:' >>>print classification_report(y_test, y_tree_predicted)
The only difference is in the input format of the training set. We need to modify the sparse matrix format to a NumPy array because the scikit tree module takes only a NumPy array.
Generally, trees are good when the number of features are very less. So, although our results look good here, people hardly use trees in text classification. On the other hand, trees have some really positive sides to them. It is still one the most intuitive algorithms and is very easy to explain and implement. There are many implementations of tree-based algorithms, such as ID3, C4.5, and C5. scikit-learn uses an optimized version of the CART algorithm.
Stochastic gradient descent (SGD) is a simple, yet very efficient approach that fits linear models. It is particularly useful when the number of samples (and the number of features) is very large. If you follow the cheat sheet, you will find SGD to be the one-stop solution for many text classification problems. Since it also takes care of regularization and provides different losses, it turns out to be a great choice when experimenting with linear models.
SGD, also known as Maximum entropy (MaxEnt), provides functionality to fit linear models for classification and regression using different (convex) loss functions and penalties. For example, with loss = log, fits a logistic regression model, while with loss = hinge, it fits a linear support vector machine (SVM).
An example of SGD is as follows:
>>>from sklearn.linear_model import SGDClassifier >>>from sklearn.metrics import confusion_matrix >>>clf = SGDClassifier(alpha=.0001, n_iter=50).fit(X_train, y_train) >>>y_pred = clf.predict(X_test) >>>print ' Here is the classification report:' >>>print classification_report(y_test, y_pred) >>>print ' confusion_matrix ' >>>cm = confusion_matrix(y_test, y_pred) >>>print cm
Here is the classification report:
precision recall f1-score support ham 0.99 1.00 0.99 1210 spam 0.96 0.91 0.93 182 avg / total 0.98 0.98 0.98 1392
Most informative features:
-1.0002 sir 2.3815 ringtoneking -0.5239 bed 2.0481 filthy -0.4763 said 1.8576 service -0.4763 happy 1.7623 story -0.4763 might 1.6671 txt -0.4287 added 1.5242 new -0.4287 list 1.4765 ringtone -0.4287 morning 1.3813 reply -0.4287 always 1.3337 message -0.4287 and 1.2860 call -0.4287 plz 1.2384 chat -0.3810 people 1.1908 text -0.3810 actually 1.1908 real -0.3810 urgnt 1.1431 video
Logistic regression is a linear model for classification. It's also known in the literature as logit regression, maximum-entropy classification (MaxEnt), or the log-linear classifier. In this model, the probabilities describing the possible outcomes of a single trial are modeled using a logit function.
As an optimization problem, the L2
binary class' penalized logistic regression minimizes the following cost function:
Similarly, L1
the binary class' regularized logistic regression solves the following optimization problem:
Support vector machines (SVM) is currently the-state-of-art algorithm in the field of machine learning.
SVM is a non-probabilistic classifier. SVM constructs a set of hyperplanes in an infinite-dimensional space, which can be used for classification, regression, or other tasks. Intuitively, a good separation is achieved by a hyperplane that has the largest distance to the nearest training data point of any class (the so-called functional margin), since in general, the larger the margin, the lower the size of classifier.
Let's build one of the most sophisticated supervised learning algorithms with scikit:
>>>from sklearn.svm import LinearSVC >>>svm_classifier = LinearSVC().fit(X_train, y_train) >>>y_svm_predicted = svm_classifier.predict(X_test) >>>print ' Here is the classification report:' >>>print classification_report(y_test, y_svm_predicted) >>>cm = confusion_matrix(y_test, y_pred) >>>print cm
Here is the classification report for the same:
precision recall f1-score support ham 0.99 1.00 0.99 1210 spam 0.97 0.90 0.93 182 avg / total 0.98 0.98 0.98 1392 confusion_matrix [[1204 6] [ 17 165]]
The most informative features:
-0.9657 road 2.3724 txt -0.7493 mail 2.0720 claim -0.6701 morning 2.0451 service -0.6691 home 2.0008 uk -0.6191 executive 1.7909 150p -0.5984 said 1.7374 www -0.5978 lol 1.6997 mobile -0.5876 kate 1.6736 50 -0.5754 got 1.5882 ringtone -0.5642 darlin 1.5629 video -0.5613 fullonsms 1.4816 tone -0.5613 fullonsms com 1.4237 prize
These are definitely the best results so far from all the supervised algorithms we have tried. Now with this, I will stop with supervised classifiers. There are millions of books available related to the different machine learning algorithms; even for individual algorithms, there are many books that are available for you. I would highly recommend you to have a deep understanding of any of the preceding algorithms before you use them for any of the real-world applications.