Chapter 7. Specialized Models

In the previous chapters, we discussed the generic cases of models. Now we have a good understanding of these models. In this chapter, we will discuss some of the special cases of Bayesian and Markov networks that are extensively used in real-life problems.

In this chapter, we will be discussing:

  • The Naive Bayes model
  • Dynamic Bayesian networks
  • The Hidden Markov model

The Naive Bayes model

The Naive Bayes model is one of the most efficient and effective learning algorithms, particularly in the field of text classification. Although over-simplistic, this model has worked out quite well. In this section, we are going to discuss the following topics:

  • What is a Naive Bayes model?
  • Why does it even work?
  • Types of Naive Bayes models

Before discussing the Naive Bayes model, let's first discuss about the Bayesian classifier. A Bayesian classifier is a probabilistic classifier that uses the Bayes theorem to predict a class. Let c be a class and The Naive Bayes model be a set of features. Then, the probability of the features belonging to class c, that is The Naive Bayes model, can be computed using the Bayes theorem as follows:

The Naive Bayes model

So, for a given set of features, the output class can be predicted as follows:

The Naive Bayes model

Here, P(c) is the prior probability of the class c and The Naive Bayes model is the likelihood of X given c. If X were an univariate feature, then computing The Naive Bayes model would be The Naive Bayes model, which is easier to compute. However, in the case of multivariate features, The Naive Bayes model is as follows:

The Naive Bayes model

The Naive Bayes model simplifies the computation of The Naive Bayes model by taking a strong independence assumption over the features.

The Naive Bayes model

Fig 7.1: Graphical model representing the Naive Bayes model

Fig 7.1 shows the graphical model corresponding to the naive assumption of a strong independence among the features. It assumes that any features The Naive Bayes model and The Naive Bayes model are conditionally independent of each other given their parent c (or The Naive Bayes model). Thus, The Naive Bayes model can be stated as follows:

The Naive Bayes model

For example, let's say we want to classify whether a given ball is a tennis ball or a football, and the variables given to us are the diameter of the ball, the color of the ball, and the type of surface. Here, the color of the ball, the size, and surface type are clearly independent variables and the type of ball depends on these three variables giving a network structure, as shown in Fig 7.2:

The Naive Bayes model

Fig 7.2: Network structure for the ball classification example

Why does it even work?

Although over-simplistic in assumption regarding dependence between the features, the Naive Bayes algorithm has performed very well in some cases. Surprisingly, it also performs very well in cases where there exists a strong dependence between the features or attributes. In this section, we are going to unravel the mystery.

Let's start with a simple binary classification problem where we have to predict the output class c based on the feature X. As it is a binary classification, there are only two output classes. For the sake of simplicity, let's assume one class to be a positive class, represented as Why does it even work?, and the other to be a negative class, represented as Why does it even work?. One simple explanation is that Naive Bayes owes its good performance to the zero-one loss function that defines the error as the number of incorrect classifications. Unlike other loss functions, such as squared error, the zero-one loss function does not penalize the incorrectness in estimating the probability as long as the maximum probability is assigned to the correct class. For example, for a given set of features X, the actual posterior probability Why does it even work? might be 0.8, and Why does it even work? might be 0.2, but due to the naive assumption regarding the dependencies between features, Naive Bayes may predict Why does it even work? as 0.6 and Why does it even work? as 0.4. Although the probability estimations are incorrect, the class predicted is same in both of these cases. Thus, Naive Bayes performed well in the case of strong dependencies between features. However, the fundamental question has not yet been answered: why couldn't the strong dependencies between features flip the classification?

Before discussing the details, let's introduce the formal definition of the equivalence of two classifiers under the zero-one loss function. Two classifiers Why does it even work? and Why does it even work? are said to be equal under the zero-one loss function, if for every X in the example space, Why does it even work? and Why does it even work?. This is denoted as Why does it even work?.

Let's assume the true graphical model Why does it even work? (as shown in the Fig 7.3) represents the dependencies between the features. The probability Why does it even work? can be stated as follows:

Why does it even work?

Here, Why does it even work? represents the parent of Why does it even work? in Why does it even work?, except for the class node c.

Why does it even work?

Fig 7.3: Graphical model representing strong dependencies between the features

To measure how strong the dependency between two features is, we have to quantify it. Naturally, the ratio of conditional probability of a node given its parents (Why does it even work?) over the conditional probability of the node without its parents (Why does it even work?) reflects how strong the parent affects the node in each class. This parameter is called a local dependence derivate and is represented as follows:

Why does it even work?
Why does it even work?

When x has no parents, Why does it even work? is defined as 1. When Why does it even work?, it means x's local dependency supports class Why does it even work?, else it supports the class Why does it even work?. Similarly, Why does it even work?. This means x's local dependency supports class is Why does it even work?, else it supports class Why does it even work?. In the case where the local dependence derivate for each class supports the other, it means they partially cancel each other out and the final classification is the class with the maximum local dependence derivate. So, to check which class supports the local dependencies, we can use the ratio of the local dependence derivate of the two classes, represented as Why does it even work?:

Why does it even work?

If Why does it even work?, then x's local dependency supports class Why does it even work?. If Why does it even work?, then x's local dependency supports class Why does it even work?. If , the local dependence distributes evenly in both Why does it even work? and Why does it even work?. Thus, the dependency does not affect the classification, however strong it may be.

As stated earlier, for classification using the Bayesian classifier, we use Why does it even work?. Thus, in cases of binary classification, we can define a variable Why does it even work? as the ratio of Why does it even work? over Why does it even work?:

Why does it even work?

If Why does it even work?, then the example is classified as belonging to the class Why does it even work?, else Why does it even work?. Summarizing all the earlier formulations, Why does it even work? can be stated as follows:

Why does it even work?

Here, Why does it even work? represents Why does it even work? in the case of the Naive Bayes model. Thus, from the earlier equation, we can state that Why does it even work? under the zero-one loss function when Why does it even work? and Why does it even work?, or Why does it even work? and Why does it even work?. Therefore, we can conclude that the distribution of dependencies between the features over classes (that is Why does it even work?) affect the classification, not merely the dependencies.

So, in cases where for each feature, Why does it even work?, Why does it even work?, that is, the local distribution of each feature is distributed evenly across both positive and negative classes, the Naive Bayes model will perform in the same way as the model representing the dependencies among the features. Further, in cases where Why does it even work?, that is, the influence of some local dependencies in favor of class Why does it even work? is canceled by the influence of some other dependencies in favor of class Why does it even work?, Naive Bayes will be an optimal classifier as well.

Because of its independence assumption, the parameters for each feature can be learned separately, which greatly simplifies the learning process and is very useful in a domain where we have very many features. In the case of document classification, the features or the attributes of a document are nothing but the words comprising it. In most of the cases, the vocabulary is huge, thus leading to a very large number of features. So, one of the major algorithms used in document classification is Naive Bayes.

Types of Naive Bayes models

There are two variants of the Naive Bayes model that are generally used for document classification.

One model specifies that a document is represented by a vector of binary attributes indicating which words occur and which words do not occur in the document. The attributes are independent of the number of times a word occurs in a document. So, the computation of the probability of a document involves multiplication of the probabilities of all the attribute values, including the probability of non-occurrence for words that do not occur in the document. Here, we consider the document to be the event, and the absence or presence of words to be attributes of the event. This describes a distribution based on a multivariate Bernoulli event model.

The second model specifies that a document is represented by the set of word occurrences in the document. In this model, however, the number of occurrences of each word in the document is captured. So, computing the probability of a document involves multiplication of the probability of the words that occur. Here we consider the individual word occurrences to be the events and the document to be a collection of word events. We call this a multinomial event model.

Multivariate Bernoulli Naive Bayes model

As stated earlier, in the case of a multivariate Bernoulli Naive Bayes model, a document is considered as a binary vector of the space of words for a given vocabulary V. The document d can be represented as Multivariate Bernoulli Naive Bayes model, where Multivariate Bernoulli Naive Bayes model corresponds to the presence of the word Multivariate Bernoulli Naive Bayes model in the document; Multivariate Bernoulli Naive Bayes model if Multivariate Bernoulli Naive Bayes model is present, Multivariate Bernoulli Naive Bayes model otherwise. More often, Multivariate Bernoulli Naive Bayes model is defined as an indicator variable representing the presence of the word Multivariate Bernoulli Naive Bayes model in the document .

Thus, Multivariate Bernoulli Naive Bayes model is defined as follows:

Multivariate Bernoulli Naive Bayes model

We can see that a document is considered as a collection of multiple independent Bernoulli experiments, one for each word in the vocabulary, with the probabilities for each of these word events defined by each component Multivariate Bernoulli Naive Bayes model.

The scikit-learn Python module provides us with the implementation of the Naive Bayes model. Let's look at an example of text classification. Before going into the details of classification, let's discuss one of the major steps in text classification known as feature extraction.

The most common strategy to extract features from a text document is called a bag-of-words representation. In this representation, documents are described by word occurrences while completely ignoring the relative position information of the words in the document. The scikit-learn Python module provides utilities for the most common ways of extracting numerical features from text content, which includes the following:

  • Tokenizing strings and giving an integer ID for each possible token
  • Counting the occurrences of tokens in each document
  • Normalizing and weighting with diminishing importance of tokens that occur in the majority of samples/documents

Let's discuss the various feature extraction methods implemented in scikit-learn with examples. The first feature extractor we will discuss is CountVectorizer. It implements both tokenization and counting occurrences:

In [1]: from sklearn.feature_extraction.text import CountVectorizer
# The input parameter min_df is a threshold which is used to 
# ignore the terms that document frequency less than the 
# threshold. By default it is set as 1.
In [2]: vectorizer = CountVectorizer(min_df=1)

In [3]: corpus = ['This is the first document.',
                  'This is the second second document.',
                  'And the third one.',
                  'Is this the first document?']

# fit_transform method basically Learn the vocabulary dictionary 
# and return term-document matrix.
In [4]: X = vectorizer.fit_transform(corpus)

# Each term found by the analyzer during the fit is assigned a 
# unique integer index corresponding to a column in the resulting 
# matrix.
In [5]: print(vectorizer.get_feature_names())
   ['and', 'document', 'first', 'is', 'one', 'second', 'the', 
    'third', 'this'])

# The numerical features can be extracted by the method toarray
# It returns a matrix in the form of (n_corpus, n_features)
# The columns correspond to vectorizer.get_feature_names(). The 
# value of a[i, j] is basically the count of word correspond to 
# column j in document i.
In [6]: print(X.toarray())
array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
          [0, 1, 0, 1, 0, 2, 1, 0, 1],
          [1, 0, 0, 0, 1, 0, 1, 1, 0],
          [0, 1, 1, 1, 0, 0, 1, 0, 1]]…)
 
# Instead of using the count we can also get the binary value 
# matrix for the given corpus by setting the binary parameter 
# equals True.
In [7]: vectorizer_binary = CountVectorizer(min_df=1, binary=True)

In [8]: X_binary = vectorizer_binary.fit_transform(corpus)
# The value of a[i, j] == 1 means that the word corresponding to 
# column j is present in document i
In [9]: print(X_binary.toarray())
   array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
             [0, 1, 0, 1, 0, 1, 1, 0, 1],
             [1, 0, 0, 0, 1, 0, 1, 1, 0],
             [0, 1, 1, 1, 0, 0, 1, 0, 1]])

Another interesting feature extractor is called tf-idf, which is short for term "frequency-inverse document frequency". In a large text corpus, some words are present in almost all the documents. These include "the", "a", and "is", hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier, these very frequent terms would shadow the frequencies of rarer yet more interesting terms. Tf-idf basically diminishes the importance of these words that occur in the majority of documents. It is basically the product of two terms, namely term frequency and inverse document frequency. Term frequency corresponds to the frequency of a term in a document, that is, the number of times the term t appears in the document d. Inverse document frequency is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. Let's take an example for Tf-idf using scikit-learn:

In [1]: from sklearn.feature_extraction.text import TfidfVectorizer

In [2]: corpus = ['This is the first document.',
                          'This is the second second document.',
                          'And the third one.',
                          'Is this the first document?']

# The input parameter min_df is a threshold which is used to 
# ignore the terms that document frequency less than the 
# threshold. By default it is set as 1.
In [3]: vectorizer = TfidfVectorizer(min_df=1)

# fit_transform method basically Learn the vocabulary dictionary 
# and return term-document matrix.
In [4]: X = vectorizer.fit_transform(corpus)

# Each term found by the analyzer during the fit is assigned a 
# unique integer index corresponding to a column in the resulting 
# matrix.
In [5]: print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the',   
 'third', 'this'])

# The numerical features can be extracted by the method toarray
# It returns a matrix in the form of (n_corpus, n_features)
# The columns correspond to vectorizer.get_feature_names(). The 
# value of a[i, j] is basically the count of word correspond to 
# column j in document i
In [6]: print(X.toarray())
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
  0.35872874  0.          0.43877674]
[ 0.          0.27230147  0.          0.27230147  0.          0.85322574
  0.22262429  0.          0.27230147]
[ 0.55280532  0.          0.          0.          0.55280532  0.
  0.28847675  0.55280532  0.        ]
[ 0.          0.43877674  0.54197657  0.43877674  0.          0.
  0.35872874  0.          0.43877674]]

There are other implementations of feature extractors in scikit-learn, such as HashingVectorizer, which uses the hashing trick to create a mapping from the string token name to the feature index. It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts. As it uses the scipy.sparse matrix, it is very memory efficient and can be used in the case of large text documents.

Let's come back to our discussion on the implementation of text classification using the multivariate Bernoulli Naive Bayes model:

In [1]: from sklearn.datasets import fetch_20newsgroups
In [2]: from sklearn.feature_extraction.text import HashingVectorizer
In [3]: from sklearn.feature_extraction.text import CountVectorizer
In [4]: from sklearn.naive_bayes import BernoulliNB
In [5]: from sklearn import metrics

# The dataset used in this example is the 20 newsgroups dataset.
# The 20 Newsgroups data set is a collection of
# approximately 20,000 newsgroup documents, partitioned (nearly)
# evenly across 20 different newsgroups. It will be
# automatically downloaded, then cached.

# For our simple example we are only going to use 4 news group
In [6]: categories = ['alt.atheism',
                      'talk.religion.misc',
                      'comp.graphics',
                      'sci.space']

# Loading training data
In [7]: data_train = fetch_20newsgroups(subset='train', 
                                        categories=categories, 
                                        shuffle=True, 
                                        random_state=42)

# Loading test data
In [8]: data_test = fetch_20newsgroups(subset='test',                      
                                       categories=categories,
                                       shuffle=True, 
                                       random_state=42)

In [9]: y_train, y_test = data_train.target, data_test.target

# It can be changed to "count" if we want to use count vectorizer
In [10]: feature_extractor_type = "hashed"

In [11]: if feature_extractor_type == "hashed":
         # To convert the text documents into numerical features, 
         # we need to use a feature extractor. In this example we 
         # are using HashingVectorizer as it would be memory
         # efficient in case of large datasets
             vectorizer = HashingVectorizer(stop_words='english')

             # In case of HashingVectorizer we don't need to fit 
             # the data, just transform would work.
             X_train = vectorizer.transform(data_train.data)
             X_test = vectorizer.transform(data_test.data)

        elif feature_extractor_type == "count":
        # The other vectorizer we can use is CountVectorizer with 
        # binary=True. But for CountVectorizer we need to fit 
        # transform over both training and test data as it 
        # requires the complete vocabulary to create the matrix
            vectorizer = CountVectorizer(stop_words='english', 
                                         binary=True)

    # First fit the data
In [12]: vectorizer.fit(data_train.data + data_test.data)
    
# Then transform it
In [13]: X_train = vectorizer.transform(data_train.data)
In [14]: X_test = vectorizer.transform(data_test.data)

# alpha is additive (Laplace/Lidstone) smoothing parameter (0 for 
# no smoothing).
In [15]: clf = BernoulliNB(alpha=.01)

# Training the classifier
In [16]: clf.fit(X_train, y_train)

# Predicting results
In [17]: y_predicted = clf.predict(X_test)

In [18]: score = metrics.accuracy_score(y_test, y_predicted)
In [19]: print("accuracy: %0.3f" % score)

Multinomial Naive Bayes model

In the previous section, we discussed the multivariate Bernoulli Naive Bayes model. In this section, we are going to discuss another variant called the multinomial model. Unlike the previous model, it captures the word frequency information of a document.

In the multinomial model, a document is considered to be an ordered sequence of word events drawn from the same vocabulary V. Again, we make a similar Naive Bayes assumption that the probability of each word event in a document is independent of the word's context and position in the document. Thus, each document Multinomial Naive Bayes model is drawn from a multinomial distribution of words with as many independent trials as the length of Multinomial Naive Bayes model. The distribution is parameterized by vectors Multinomial Naive Bayes model for all Multinomial Naive Bayes model, where |V| is the size of the vocabulary and Multinomial Naive Bayes model represents the probability of the word Multinomial Naive Bayes model belonging to the class c, that is Multinomial Naive Bayes model.

The parameter Multinomial Naive Bayes model is estimated by a maximum likelihood estimate as follows:

Multinomial Naive Bayes model

Here, Multinomial Naive Bayes model is defined as the number of times the word Multinomial Naive Bayes model appeared in the sample of class c in the training set Multinomial Naive Bayes model, where Multinomial Naive Bayes model represents the word count of Multinomial Naive Bayes model in the document d, T[c] represents all the samples of the training set T belonging to the class c, and Multinomial Naive Bayes model is defined as the total count of all features for class c, that is Multinomial Naive Bayes model.

The smoothing parameter Multinomial Naive Bayes model accounts for features not present in the learning samples. It prevents the assignment of zero probabilities to words not present in a particular class. Setting Multinomial Naive Bayes model is called Laplace smoothing, while Multinomial Naive Bayes modelis called Lidstone smoothing.

It's implementation in Python is as follows:

In [1]: from sklearn.datasets import fetch_20newsgroups
In [2]: from sklearn.feature_extraction.text import TfidfVectorizer
In [3]: from sklearn.feature_extraction.text import CountVectorizer
In [4]: from sklearn.naive_bayes import MultinomialNB
In [5]: from sklearn import metrics

# Just like the previous example, here also we are going to deal
# 20 newsgroup data.

In [6]: categories = ['alt.atheism',
                      'talk.religion.misc',
                      'comp.graphics',
                      'sci.space']

# Loading training data
In [7]: data_train = fetch_20newsgroups(subset='train',
                                        categories=categories, 
                                        shuffle=True, 
                                        random_state=42)

# Loading test data
In [8]: data_test = fetch_20newsgroups(subset='test',
                                       categories=categories,
                                       shuffle=True, 
                                       random_state=42)

In [9]: y_train, y_test = data_train.target, data_test.target
In [10]: feature_extractor_type = "tfidf"

In [11]: if feature_extractor_type == "count":
         # The other vectorizer we can use is CountVectorizer
         # But for CountVectorizer we need to fit transform over 
         # both training and test data as it requires the complete 
         # vocabulary to create the matrix
             vectorizer = CountVectorizer(stop_words='english')
             vectorizer.fit(data_train.data + data_test.data)
             X_train = vectorizer.transform(data_train.data)
             X_test = vectorizer.transform(data_test.data)

         elif feature_extractor_type == "tfidf":
             vectorizer = TfidfVectorizer(stop_words="english")
             X_train = vectorizer.fit_transform(data_train.data)
             X_test = vectorizer.transform(data_test.data)

# alpha is additive (Laplace/Lidstone) smoothing parameter (0 for 
# no smoothing).
In [12]: clf = MultinomialNB(alpha=.01)

# Training the classifier
In [13]: clf.fit(X_train, y_train)

# Predicting results
In [14]: y_predicted = clf.predict(X_test)

In [15]: score = metrics.accuracy_score(y_test, y_predicted)
In [16]: print("accuracy: %0.3f" % score)

Choosing the right model

In the previous sections, we discussed two different variants of Naive Bayes models, the multivariate Bernoulli model and the multinomial model. There has been a lot of research on which model to choose. McCallum and Nigam (1998) did extensive comparisons of both the models (refer to the research paper titled A Comparison of Event Models for Naive Bayes Text Classification). They found that the multivariate Bernoulli model performs well with small vocabulary sizes, but the multinomial Bernoulli model usually performs better at larger vocabulary sizes, providing on average, a 27 percent reduction in error over the multivariate Bernoulli model at any vocabulary size. However, it is advisable to evaluate both the models.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset