In the previous chapters, we discussed the generic cases of models. Now we have a good understanding of these models. In this chapter, we will discuss some of the special cases of Bayesian and Markov networks that are extensively used in real-life problems.
In this chapter, we will be discussing:
The Naive Bayes model is one of the most efficient and effective learning algorithms, particularly in the field of text classification. Although over-simplistic, this model has worked out quite well. In this section, we are going to discuss the following topics:
Before discussing the Naive Bayes model, let's first discuss about the Bayesian classifier. A Bayesian classifier is a probabilistic classifier that uses the Bayes theorem to predict a class. Let c be a class and be a set of features. Then, the probability of the features belonging to class c, that is , can be computed using the Bayes theorem as follows:
So, for a given set of features, the output class can be predicted as follows:
Here, P(c) is the prior probability of the class c and is the likelihood of X given c. If X were an univariate feature, then computing would be , which is easier to compute. However, in the case of multivariate features, is as follows:
The Naive Bayes model simplifies the computation of by taking a strong independence assumption over the features.
Fig 7.1 shows the graphical model corresponding to the naive assumption of a strong independence among the features. It assumes that any features and are conditionally independent of each other given their parent c (or ). Thus, can be stated as follows:
For example, let's say we want to classify whether a given ball is a tennis ball or a football, and the variables given to us are the diameter of the ball, the color of the ball, and the type of surface. Here, the color of the ball, the size, and surface type are clearly independent variables and the type of ball depends on these three variables giving a network structure, as shown in Fig 7.2:
Although over-simplistic in assumption regarding dependence between the features, the Naive Bayes algorithm has performed very well in some cases. Surprisingly, it also performs very well in cases where there exists a strong dependence between the features or attributes. In this section, we are going to unravel the mystery.
Let's start with a simple binary classification problem where we have to predict the output class c based on the feature X. As it is a binary classification, there are only two output classes. For the sake of simplicity, let's assume one class to be a positive class, represented as , and the other to be a negative class, represented as . One simple explanation is that Naive Bayes owes its good performance to the zero-one loss function that defines the error as the number of incorrect classifications. Unlike other loss functions, such as squared error, the zero-one loss function does not penalize the incorrectness in estimating the probability as long as the maximum probability is assigned to the correct class. For example, for a given set of features X, the actual posterior probability might be 0.8, and might be 0.2, but due to the naive assumption regarding the dependencies between features, Naive Bayes may predict as 0.6 and as 0.4. Although the probability estimations are incorrect, the class predicted is same in both of these cases. Thus, Naive Bayes performed well in the case of strong dependencies between features. However, the fundamental question has not yet been answered: why couldn't the strong dependencies between features flip the classification?
Before discussing the details, let's introduce the formal definition of the equivalence of two classifiers under the zero-one loss function. Two classifiers and are said to be equal under the zero-one loss function, if for every X in the example space, and . This is denoted as .
Let's assume the true graphical model (as shown in the Fig 7.3) represents the dependencies between the features. The probability can be stated as follows:
Here, represents the parent of in , except for the class node c.
To measure how strong the dependency between two features is, we have to quantify it. Naturally, the ratio of conditional probability of a node given its parents () over the conditional probability of the node without its parents () reflects how strong the parent affects the node in each class. This parameter is called a local dependence derivate and is represented as follows:
When x has no parents, is defined as 1. When , it means x's local dependency supports class , else it supports the class . Similarly, . This means x's local dependency supports class is , else it supports class . In the case where the local dependence derivate for each class supports the other, it means they partially cancel each other out and the final classification is the class with the maximum local dependence derivate. So, to check which class supports the local dependencies, we can use the ratio of the local dependence derivate of the two classes, represented as :
If , then x's local dependency supports class . If , then x's local dependency supports class . If , the local dependence distributes evenly in both and . Thus, the dependency does not affect the classification, however strong it may be.
As stated earlier, for classification using the Bayesian classifier, we use . Thus, in cases of binary classification, we can define a variable as the ratio of over :
If , then the example is classified as belonging to the class , else . Summarizing all the earlier formulations, can be stated as follows:
Here, represents in the case of the Naive Bayes model. Thus, from the earlier equation, we can state that under the zero-one loss function when and , or and . Therefore, we can conclude that the distribution of dependencies between the features over classes (that is ) affect the classification, not merely the dependencies.
So, in cases where for each feature, , , that is, the local distribution of each feature is distributed evenly across both positive and negative classes, the Naive Bayes model will perform in the same way as the model representing the dependencies among the features. Further, in cases where , that is, the influence of some local dependencies in favor of class is canceled by the influence of some other dependencies in favor of class , Naive Bayes will be an optimal classifier as well.
Because of its independence assumption, the parameters for each feature can be learned separately, which greatly simplifies the learning process and is very useful in a domain where we have very many features. In the case of document classification, the features or the attributes of a document are nothing but the words comprising it. In most of the cases, the vocabulary is huge, thus leading to a very large number of features. So, one of the major algorithms used in document classification is Naive Bayes.
There are two variants of the Naive Bayes model that are generally used for document classification.
One model specifies that a document is represented by a vector of binary attributes indicating which words occur and which words do not occur in the document. The attributes are independent of the number of times a word occurs in a document. So, the computation of the probability of a document involves multiplication of the probabilities of all the attribute values, including the probability of non-occurrence for words that do not occur in the document. Here, we consider the document to be the event, and the absence or presence of words to be attributes of the event. This describes a distribution based on a multivariate Bernoulli event model.
The second model specifies that a document is represented by the set of word occurrences in the document. In this model, however, the number of occurrences of each word in the document is captured. So, computing the probability of a document involves multiplication of the probability of the words that occur. Here we consider the individual word occurrences to be the events and the document to be a collection of word events. We call this a multinomial event model.
As stated earlier, in the case of a multivariate Bernoulli Naive Bayes model, a document is considered as a binary vector of the space of words for a given vocabulary V. The document d can be represented as , where corresponds to the presence of the word in the document; if is present, otherwise. More often, is defined as an indicator variable representing the presence of the word in the document .
Thus, is defined as follows:
We can see that a document is considered as a collection of multiple independent Bernoulli experiments, one for each word in the vocabulary, with the probabilities for each of these word events defined by each component .
The scikit-learn
Python module provides us with the implementation of the Naive Bayes model. Let's look at an example of text classification. Before going into the details of classification, let's discuss one of the major steps in text classification known as feature extraction.
The most common strategy to extract features from a text document is called a bag-of-words representation. In this representation, documents are described by word occurrences while completely ignoring the relative position information of the words in the document. The scikit-learn
Python module provides utilities for the most common ways of extracting numerical features from text content, which includes the following:
Let's discuss the various feature extraction methods implemented in scikit-learn
with examples. The first feature extractor we will discuss is CountVectorizer
. It implements both tokenization and counting occurrences:
In [1]: from sklearn.feature_extraction.text import CountVectorizer # The input parameter min_df is a threshold which is used to # ignore the terms that document frequency less than the # threshold. By default it is set as 1. In [2]: vectorizer = CountVectorizer(min_df=1) In [3]: corpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?'] # fit_transform method basically Learn the vocabulary dictionary # and return term-document matrix. In [4]: X = vectorizer.fit_transform(corpus) # Each term found by the analyzer during the fit is assigned a # unique integer index corresponding to a column in the resulting # matrix. In [5]: print(vectorizer.get_feature_names()) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']) # The numerical features can be extracted by the method toarray # It returns a matrix in the form of (n_corpus, n_features) # The columns correspond to vectorizer.get_feature_names(). The # value of a[i, j] is basically the count of word correspond to # column j in document i. In [6]: print(X.toarray()) array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]]…) # Instead of using the count we can also get the binary value # matrix for the given corpus by setting the binary parameter # equals True. In [7]: vectorizer_binary = CountVectorizer(min_df=1, binary=True) In [8]: X_binary = vectorizer_binary.fit_transform(corpus) # The value of a[i, j] == 1 means that the word corresponding to # column j is present in document i In [9]: print(X_binary.toarray()) array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 1, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]])
Another interesting feature extractor is called tf-idf, which is short for term "frequency-inverse document frequency". In a large text corpus, some words are present in almost all the documents. These include "the", "a", and "is", hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier, these very frequent terms would shadow the frequencies of rarer yet more interesting terms. Tf-idf basically diminishes the importance of these words that occur in the majority of documents. It is basically the product of two terms, namely term frequency and inverse document frequency. Term frequency corresponds to the frequency of a term in a document, that is, the number of times the term t appears in the document d. Inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. Let's take an example for Tf-idf
using scikit-learn
:
In [1]: from sklearn.feature_extraction.text import TfidfVectorizer In [2]: corpus = ['This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?'] # The input parameter min_df is a threshold which is used to # ignore the terms that document frequency less than the # threshold. By default it is set as 1. In [3]: vectorizer = TfidfVectorizer(min_df=1) # fit_transform method basically Learn the vocabulary dictionary # and return term-document matrix. In [4]: X = vectorizer.fit_transform(corpus) # Each term found by the analyzer during the fit is assigned a # unique integer index corresponding to a column in the resulting # matrix. In [5]: print(vectorizer.get_feature_names()) ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']) # The numerical features can be extracted by the method toarray # It returns a matrix in the form of (n_corpus, n_features) # The columns correspond to vectorizer.get_feature_names(). The # value of a[i, j] is basically the count of word correspond to # column j in document i In [6]: print(X.toarray()) [[ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674] [ 0. 0.27230147 0. 0.27230147 0. 0.85322574 0.22262429 0. 0.27230147] [ 0.55280532 0. 0. 0. 0.55280532 0. 0.28847675 0.55280532 0. ] [ 0. 0.43877674 0.54197657 0.43877674 0. 0. 0.35872874 0. 0.43877674]]
There are other implementations of feature extractors in scikit-learn
, such as HashingVectorizer
, which uses the hashing trick to create a mapping from the string token name to the feature index. It turns a collection of text documents into a scipy.sparse
matrix holding token occurrence counts. As it uses the scipy.sparse
matrix, it is very memory efficient and can be used in the case of large text documents.
Let's come back to our discussion on the implementation of text classification using the multivariate Bernoulli Naive Bayes model:
In [1]: from sklearn.datasets import fetch_20newsgroups In [2]: from sklearn.feature_extraction.text import HashingVectorizer In [3]: from sklearn.feature_extraction.text import CountVectorizer In [4]: from sklearn.naive_bayes import BernoulliNB In [5]: from sklearn import metrics # The dataset used in this example is the 20 newsgroups dataset. # The 20 Newsgroups data set is a collection of # approximately 20,000 newsgroup documents, partitioned (nearly) # evenly across 20 different newsgroups. It will be # automatically downloaded, then cached. # For our simple example we are only going to use 4 news group In [6]: categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] # Loading training data In [7]: data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) # Loading test data In [8]: data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) In [9]: y_train, y_test = data_train.target, data_test.target # It can be changed to "count" if we want to use count vectorizer In [10]: feature_extractor_type = "hashed" In [11]: if feature_extractor_type == "hashed": # To convert the text documents into numerical features, # we need to use a feature extractor. In this example we # are using HashingVectorizer as it would be memory # efficient in case of large datasets vectorizer = HashingVectorizer(stop_words='english') # In case of HashingVectorizer we don't need to fit # the data, just transform would work. X_train = vectorizer.transform(data_train.data) X_test = vectorizer.transform(data_test.data) elif feature_extractor_type == "count": # The other vectorizer we can use is CountVectorizer with # binary=True. But for CountVectorizer we need to fit # transform over both training and test data as it # requires the complete vocabulary to create the matrix vectorizer = CountVectorizer(stop_words='english', binary=True) # First fit the data In [12]: vectorizer.fit(data_train.data + data_test.data) # Then transform it In [13]: X_train = vectorizer.transform(data_train.data) In [14]: X_test = vectorizer.transform(data_test.data) # alpha is additive (Laplace/Lidstone) smoothing parameter (0 for # no smoothing). In [15]: clf = BernoulliNB(alpha=.01) # Training the classifier In [16]: clf.fit(X_train, y_train) # Predicting results In [17]: y_predicted = clf.predict(X_test) In [18]: score = metrics.accuracy_score(y_test, y_predicted) In [19]: print("accuracy: %0.3f" % score)
In the previous section, we discussed the multivariate Bernoulli Naive Bayes model. In this section, we are going to discuss another variant called the multinomial model. Unlike the previous model, it captures the word frequency information of a document.
In the multinomial model, a document is considered to be an ordered sequence of word events drawn from the same vocabulary V. Again, we make a similar Naive Bayes assumption that the probability of each word event in a document is independent of the word's context and position in the document. Thus, each document is drawn from a multinomial distribution of words with as many independent trials as the length of . The distribution is parameterized by vectors for all , where |V| is the size of the vocabulary and represents the probability of the word belonging to the class c, that is .
The parameter is estimated by a maximum likelihood estimate as follows:
Here, is defined as the number of times the word appeared in the sample of class c in the training set , where represents the word count of in the document d, T[c] represents all the samples of the training set T belonging to the class c, and is defined as the total count of all features for class c, that is .
The smoothing parameter accounts for features not present in the learning samples. It prevents the assignment of zero probabilities to words not present in a particular class. Setting is called Laplace smoothing, while is called Lidstone smoothing.
It's implementation in Python is as follows:
In [1]: from sklearn.datasets import fetch_20newsgroups In [2]: from sklearn.feature_extraction.text import TfidfVectorizer In [3]: from sklearn.feature_extraction.text import CountVectorizer In [4]: from sklearn.naive_bayes import MultinomialNB In [5]: from sklearn import metrics # Just like the previous example, here also we are going to deal # 20 newsgroup data. In [6]: categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] # Loading training data In [7]: data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42) # Loading test data In [8]: data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42) In [9]: y_train, y_test = data_train.target, data_test.target In [10]: feature_extractor_type = "tfidf" In [11]: if feature_extractor_type == "count": # The other vectorizer we can use is CountVectorizer # But for CountVectorizer we need to fit transform over # both training and test data as it requires the complete # vocabulary to create the matrix vectorizer = CountVectorizer(stop_words='english') vectorizer.fit(data_train.data + data_test.data) X_train = vectorizer.transform(data_train.data) X_test = vectorizer.transform(data_test.data) elif feature_extractor_type == "tfidf": vectorizer = TfidfVectorizer(stop_words="english") X_train = vectorizer.fit_transform(data_train.data) X_test = vectorizer.transform(data_test.data) # alpha is additive (Laplace/Lidstone) smoothing parameter (0 for # no smoothing). In [12]: clf = MultinomialNB(alpha=.01) # Training the classifier In [13]: clf.fit(X_train, y_train) # Predicting results In [14]: y_predicted = clf.predict(X_test) In [15]: score = metrics.accuracy_score(y_test, y_predicted) In [16]: print("accuracy: %0.3f" % score)
In the previous sections, we discussed two different variants of Naive Bayes models, the multivariate Bernoulli model and the multinomial model. There has been a lot of research on which model to choose. McCallum and Nigam (1998) did extensive comparisons of both the models (refer to the research paper titled A Comparison of Event Models for Naive Bayes Text Classification). They found that the multivariate Bernoulli model performs well with small vocabulary sizes, but the multinomial Bernoulli model usually performs better at larger vocabulary sizes, providing on average, a 27 percent reduction in error over the multivariate Bernoulli model at any vocabulary size. However, it is advisable to evaluate both the models.