4.4. Naïve Bayesian

The data mining algorithms used for classification tasks are quite diverse. The objective of these algorithms is the same—prediction of a target variable. But, the method of predicting is drawn from a range of multidisciplinary techniques. The naïve Bayes algorithm finds its roots in statistics and probability theory. In general, classification techniques try to predict class labels based on attributes by best approximating the relationship between input and output variables. Every day, we mentally estimate a myriad of outcomes based on past evidence. Consider the process of guessing commuting time to work. First, commute time depends heavily on when you are traveling. If you are traveling during peak hours, most likely the commute is going to be longer. Weather conditions like rain, snow, or dense fog will slow down the commute. If the day is a school holiday, like summer break, then the commute will be lighter than on school days. If there is any planned road work ahead, the commute usually takes longer. When more than one adverse factor is at play, then the commute will be even longer than if it is just one isolated factor. All this if-then knowledge is based on previous experience of commuting when one or more factors come into play. Our experience creates a model in our brain and we mentally run the model before pulling out of the driveway!
Let’s take the case of defaults in home mortgages and assume the average overall default rate is 2%. The likelihood of an average person defaulting on their mortgage loan is 2%. However, if a given individual’s credit history is above average (or excellent), then the likelihood of their default would be less than average. Furthermore, if we know that the person’s annual salary is above average with respect to loan value, then the likelihood of default falls further. As we obtain more evidence on the factors that impact the outcome, we can make improved guesses about the outcome using probability theory. The naïve Bayesian algorithm basically leverages the probabilistic relationship between the factors (attributes) and the class label (outcome). The algorithm makes a strong and sometimes naïve assumption of independence between the attributes, thus its name. The independence assumption between attributes may not always hold true. In some cases, we can assume annual income and credit score are independent of each other. However, in many cases we just don’t know. If one of the factors for the default rate is home value, then we have a scenario where both the annual income and home value factors are correlated and thus not independent. Homeowners with high income tend to buy more expensive houses. The independence assumption doesn’t hold true always, but the simplicity and robustness of the algorithm offsets the limitation introduced by the independence assumption.
Predicting and Filtering Spam Email
Spam is unsolicited bulk email sent to a wide number of email users. At best it is an annoyance to recipients but many of the spam emails hide a malicious intent by hosting false advertisements or redirecting clicks to phishing sites. Filtering spam email is one of the essential features provided by email service providers and administrators. The key challenge is balance between incorrectly flagging a legitimate email as spam (false positive) versus not catching all the spam messages. There is no perfect spam filtering solution and spam detecting is a catch-up game. The spammers always try to deceive and outsmart the spam filters and email administrators fortify the filters for various new spam scenarios. Automated spam filtering based on algorithms provides a promising solution in containing spam and a learning framework to update the filtering solutions (Prosess Software, 2013).
Some words occur in spam emails more often than in legitimate email messages. For example, the probability of occurrence for words like free, mortgage, credit, sale, Viagra, etc. is higher in spam mails than in normal emails. We can calculate the exact probabilities if we have a sample of previously known spam emails and regular emails. Based on the known word probabilities, we can compute the overall probability of an email being spam based on all the words in the email and the probability of each word being in spam versus regular emails. This is the foundation of Bayesian spam filtering systems (Zdziarski, 2005). Any wrongly classified spam messages that are subsequently reclassified by the user is an opportunity to refine the model, making spam filtering adaptive to new spam techniques. Though recent spam reduction uses a combination of different algorithms, Bayesian-based spam filtering remains one of the foundational elements of spam prediction systems (Sahami et al., 1998).

4.4.1. How it Works

The naïve Bayesian algorithm is built on Bayes’ theorem, named after Reverend Thomas Bayes. Bayes’ work is described in “Essay Towards Solving a Problem in the Doctrine of Chances” (1763), published posthumously in the Philosophical Transactions of the Royal Society of London by Richard Price. Bayes’ theorem is one of the most influential and important concepts in statistics and probability theory. It provides a mathematical expression for how a degree of subjective belief changes to account for new evidence. First, let’s discuss the terminology used in Bayes’ theorem.
Assume X is the evidence (or factors or attribute set) and Y is the outcome (or target or label class). Here X is a set, not individual attributes, hence X = {X1, X2, X3, …, Xn}, where Xi is an individual attribute, such as credit rating. The probability of outcome P(Y) is called prior probability, which can be calculated from the data set. Prior probability shows the likelihood of an outcome in a given data set. For example, in the mortgage case, P(Y) is the default rate of a home mortgage, which is 2%. P(Y|X) is called the conditional probability, which provides the probability of an outcome given the evidence when we know the value of X. Again, using the mortgage example, P(Y|X) is the average rate of default given that an individual’s credit history is known. If the credit history is excellent, then the probability of default is likely to be less than 2%. P(Y|X) is also called posterior probability. Calculating posterior probability is the objective of predictive analytics using Bayes’ theorem. This is the likelihood of an outcome as we learn the values of the input attributes.
Bayes’ theorem states that
icon(4.13)
P(X|Y) is another conditional probability, called the class conditional probability. P(X|Y) is the probability that an attribute assumes a particular value given the class label. Like P(Y), P(X|Y) can be calculated from the data set as well. If we know the training set of loan defaults, we can calculate the probability of an “excellent” credit rating given that the default is a “yes.” As indicated in Bayes’ theorem, class conditional probability is crucial in calculating posterior probability. P(X) is basically the probability of the evidence. In the mortgage example, this is simply the proportion of individuals with a given credit rating. To classify a new record, we can compute P(Y|X) for each class of Y and see which probability “wins.” Class label Y with the highest value of P(Y|X) wins for a particular attribute value X. Since P(X) is the same for every class value of the outcome, we don’t have to calculate this and assume it as a constant. More generally, in an example set with n attributes X = {X1, X2, X3 … Xn},
icon(4.14)
If we know how to calculate class conditional probability P(X|Y) or icon, then it is easy to calculate posterior probability P(Y|X). Since P(X)image is constant for every value of Y, it is enough to calculate the numerator of the equation icon for every class value.
To further explain how the naïve Bayesian algorithm works, let’s use the modified Golf data set shown in Table 4.4. The Golf table is an artificial data set with four attributes and one class label. Note that we are using the nominal data table for easier explanation (temperature and humidity have been converted from the numeric type). In Bayesian terms, weather condition is the evidence and decision to play or not play is the belief. Altogether there are 14 examples with 5 examples of Play = no and nine examples of Play = yes. The objective is to predict if the player will Play (yes or no), given the information about a few weather-related measures, based on learning from the data set in Table 4.4. Here is the step-by-step explanation of how the Bayesian model works.

Step 1: Calculating Prior Probability P(Y)

Prior probability P(Y) is the probability of an outcome. In this example set there are two possible outcomes: Play = yes and Play = no. From Table 4.4, out of 14 records there are 5 records with the “no” class and 9 records with the “Yes” class. The probability of outcome is
P(Y = no) = 5/14
P(Y = yes) = 9/14

Table 4.4

Golf Data Set with Modified Temperature and Humidity Attributes

No.Temperature X1Humidity X2Outlook X3Wind X4Play (Class Label) Y
1highmedsunnyfalseno
2highhighsunnytrueno
3lowlowraintrueno
4medhighsunnyfalseno
5lowmedraintrueno
6highmedovercastfalseyes
7lowhighrainfalseyes
8lowmedrainfalseyes
9lowlowovercasttrueyes
10lowlowsunnyfalseyes
11medmedrainfalseyes
12medlowsunnytrueyes
13medhighovercasttrueyes
14highlowovercastfalseyes

image

Since the probability of an outcome is calculated from the data set, it is important that the data set used for data mining is representative of the population, if sampling is used. A class-stratified sampling of data from the population will not be compatible for naïve Bayesian modeling.

Step 2: Calculating Class Conditional Probability P(Xi|Y)image

Class conditional probability is the probability of each attribute value for an attribute, for each outcome value. This calculation is repeated for all the attributes: Temperature (X1), Humidity (X2), Outlook (X3), and Wind(X4), and for every distinct outcome value. Let’s calculate the class conditional probability of Temperature (X1). For each value of the Temperature attribute, we can calculate P(X1|Y = no) and P(X1|Y = yes) by constructing a probability table as shown in Table 4.5. From the data set there are five Y = no records and nine Y = yes records. Out of the five Y = no records, we can also calculate the probability of occurrence when the temperature is high, medium, and low. The values will be 2/5, 1/5, and 2/5, respectively. We can repeat the same process when the outcome Y = yes.
Similarly, we can repeat the calculation to find the class conditional probability for the other three attributes: Humidity (X2), Outlook (X3), and Wind(X4). This class conditional probability table is shown in Table 4.6.

Table 4.5

Class Conditional Probability of Temperature

Temperature (X1)P(X1|Y = no)P(X1|Y = yes)
high2/52/9
med1/53/9
low2/54/9

Table 4.6

Conditional Probability of Humidity, Outlook, and Wind

Humidity (X2)P(X1|Y = no)P(X1|Y = yes)
high2/52/9
low1/54/9
med2/53/9
Outlook (X3)P(X1|Y = no)P(X1|Y = yes)
overcast0/54/9
Rain2/53/9
sunny3/52/9
Wind (X4)P(X1|Y = no)P(X1|Y = yes)
false2/56/9
true3/53/9

Table 4.7

Test Record

No.Temperature X1Humidity X2Outlook X3Wind X4Play (Class Label) Y
Unlabeled Testhighlowsunnyfalse?

image

Step 3: Predicting the Outcome Using Bayes’ Theorem

We are all set with preparing class conditional probability tables and now they can be used in the future prediction task. If a new, unlabeled test record (Table 4.7) has the attribute values Temperature= high, Humidity = low, Outlook = sunny, and Wind = false, what would be the class label prediction? Play = Yes or No? The outcome class can be predicted based on Bayes’ theorem by calculating the posterior probability P(Y|X) for both values of Y. Once P(Y = yes|X) and P(Y = no|X) are calculated, we can determine which outcome has higher probability and the predicted outcome is the one that has the highest probability. While calculating both class conditional probabilities using Equation 4.14, it is sufficient to just calculate icon as P(X) is going to be same for both the outcome classes.
P(Y = yes|X) = P(Y)i=1nP(Xi|Y)P(X)image
= P(Y = yes) ∗ {P(Temp = high|Y = yes) ∗ P(Humidity = low|Y = yes) ∗ P(Outlook = sunny| Y = yes) ∗ P(Wind = false|Y = yes)}/P(X)
= 9/14 ∗ {2/9 ∗ 4/9 ∗ 2/9 ∗ 6/9}/P(X)
= 0.0094/P(X)
P(Y = no|X) = 5/14 ∗ {2/5 ∗ 4/5 ∗ 3/5 ∗ 2/5}
= 0.0274/P(X)
We normalize both the estimates by dividing both by (0.0094 + 0.027) to get
Likelihood of (Play = yes) = 0.00940.0274+0.0094image = 26%
Likelihood of (Play = no) = 0.00940.0274+0.0094image = 74%
In this case P(Y = yes|X) < P(Y = no|X), hence the prediction for the unlabeled test record will be Play = no.
Bayesian modeling is relatively simple to understand once you get past the notation (for beginners) and easy to implement in practically any programing language. The computation for model building is quite simple and involves the creation of a lookup table of probabilities. Bayesian modeling is quite robust in handling missing values. If the test example set does not contain a value, let’s suppose temperature is not calculated in the example set, the Bayesian model simply omits the corresponding class conditional probability for the outcomes. Having missing values in the test set would be difficult to handle in decision trees and regression algorithms, particularly when the missing attribute is used higher up in the node of the decision tree or has more weight in regression. Even though the naïve Bayes algorithm is quite robust to missing attributes, it does have a few limitations. Here are couple of the most significant limitations and methods of mitigation.

Issue 1: Incomplete Training Set

Problems arise when an attribute value in the testing record has no example in the training record. In the Golf dataset (Table 4.4), if a test example consists of the attribute value Outlook = overcast, the probability of P(Outlook = overcast|Y = no) is zero. Even if one of the attribute’s class conditional probabilities is zero, by nature of the Bayesian equation, the entire posterior probability will be zero.
P(Y = no|X) = P(Y = No) ∗ {P(Temp = high|Y = no) ∗ P(Humidity = low| Y = no) ∗ P(Outlook = overcast|Y = no) ∗ P(Wind = false|Y = no)}/P(X)
= 5/14 ∗ {2/5 ∗ 1/5 ∗ 0 ∗ 2/5}/P(X)
= 0
In this case P(Y = yes|X) > P(Y = no|X), and the test example will be classified as Play = yes. If there are no training records for any other attribute value, like Temperature = low for outcome yes, then probability of both outcomes, P(Y = no|X) and P(Y = yes|X), will also be zero and an arbitrary prediction shall be made because of the dilemma.
To mitigate this problem, we can assign small default probabilities for the missing records instead of zero. With this, the absence of an attribute value doesn’t wipe out the value of P(X|Y), albeit it will reduce the probability to small number. This technique is called Laplace correction. Laplace correction adds a controlled error in all class conditional probabilities.
In the above data set, if the example set contains Outlook = overcast, then P(X|Y = no) = 0. The class conditional probability for all the three values for Outlook is 0/5, 2/5, and 3/5, Y = no. We can add controlled error by adding 1 to all numerators and 3 for all denominators, so the class conditional probabilities are 1/8, 3/8 and 4/8. The sum of all the class conditional probabilities is still 1. Generically, the Laplace correction is given by

correctedprobabilityP(Xi|Y)=0+μp35+μ,2+μp25+μ,3+μp25+μ

image (4.15)

where p1 + p2 + p3 = 1 and μ is the correction.

Issue 2: Continuous Attributes

If an attribute has continuous numeric values instead of nominal values, the solution discussed above will not work. We can always convert the continuous values to nominal values by discretization and use the same approach as discussed. But discretization requires exercising subjective judgment on the bucketing range, leading to loss of information. Instead, we can preserve the continuous values as such and use the probability density function. We assume the probability distribution for a numerical attribute follows a normal or Gaussian distribution. If the attribute value is known to follow some other distribution, such as Poisson, the equivalent probability density function can be used. The probability density function for a normal distribution is given by

f(x)=12πσe(xμ)22σ2

image (4.16)

where μ is the mean and σimage is the standard deviation of the sample.
In the updated Golf data set shown in Table 4.8, temperature and humidity are continuous attributes. In such a situation, we compute the mean and standard deviation for both class labels (Play = yes and Play = no) for temperature and humidity (Table 4.9).
If an unlabeled test record has a Humidity value of 78, we can compute the probability density using the Equation 4.16, for both outcomes. For outcome Play = yes, if we plug in the values x = 78, μ = 73, and σimage = 6.16 to the probability density function, the equation renders the value 0.04. Similarly for outcome Play = no, we can plug in x = 78, μ = 74.6, σimage = 7.89 and compute the probability density to obtain 0.05:

Table 4.8

Golf Data Set with Continuous Attributes

No.Outlook X1Humidity X2Temperature X3Wind X4Play Y
1sunny8585falseno
2sunny8090trueno
6rain6570trueno
8sunny7295falseno
14rain7180trueno
3overcast8378falseyes
4rain7096falseyes
5rain6880falseyes
7overcast6465trueyes
9sunny6970falseyes
10rain7580falseyes
11sunny7570trueyes
12overcast7290trueyes
13overcast8175falseyes

image

Table 4.9

Mean and Deviation for Continuous Attributes

Play ValueHumidity X2Temperature X3
Y = noMean74.6084.00
Deviation7.899.62
Y = yesMean73.0078.22
Deviation6.169.88

image

P(temperature = 78|Y = yes) = 0.04
P(temperature = 78|Y = no) = 0.05
The above values are probability densities and not probabilities. In a continuous scale, the probability of temperature being exactly at a particular value is zero. Instead, the probability is computed for a range, such as temperatures from 77.5 to 78.5 units. Since the same range is used for computing the probability density for both the outcomes, Play = yes and Play = no, it is not necessary to compute the actual probability. Hence we can substitute the above values in the Bayesian equation 4.14 for calculating class conditional probability P(X|Y).

Issue 3: Attribute Independence

One of the fundamental assumptions in the naïve Bayesian model is attribute independence. Bayes’ theorem is guaranteed only for independent attributes. In many real-life cases, this is quite a stringent condition to deal with. This is why the technique is called “naïve” Bayesian, because it assumes attributes independence. However, in practice the naïve Bayesian model works fine with slightly correlated features (Rish, 2001). We can handle this problem by pre-processing the data. Before applying the naïve Bayesian algorithm, it makes sense to remove strongly correlated attributes. In the case of all numeric attributes, this can be achieved by computing a weighted correlation matrix. An advanced application of Bayes’ theorem, called a Bayesian belief network, is designed to handle data sets with attribute dependencies.
The independence of two categorical (nominal) attributes can be tested by the chi-square (χ2) test for independence. The chi-square test can be calculated by creating a contingency table of observed frequency like the one shown in Table 4.10A. A contingency table is a simple cross tab of two attributes under consideration.

Table 4.10

Contingency Tables with Observed Frequency (A) and Expected Frequency (B)

WindWind
OutlookFalseTrueTotalOutlookFalseTrueTotal
overcast224overcast2.291.714
rain325rain2.862.145
sunny325sunny2.862.145
Total8614Total8614

image

A contingency table of expected frequency (Table 4.10b) can also be created based on the following equation:

Er,c=(rowtotalcolumntotal)(tabletotal)

image (4.17)

The chi-square statistic (χ2) calculates the sum of the difference between these two tables. χ2 is calculated by Equation 4.18. In this equation, O is observed frequency and E is expected frequency:

χ2=(OE)2/E

image (4.18)

If the chi-square statistic (χ2) is less than the critical value calculated from the chi-square distribution for a given confidence level, then we can assume the two variables under consideration are independent, for practical purposes. This entire test can be performed in statistical tools or in Microsoft Excel.

4.4.2. How to Implement

The naïve Bayesian model is one of the few data mining techniques that can be easily implemented in almost any programing language. Since the conditional probability tables can be prepared in the model building phase, the execution of the model in runtime is very quick. Data mining tools have dedicated naïve Bayes classifier functions. In RapidMiner, Naïve Bayes is operator available under Modeling > Classification. The process of building a model and applying it to new data is similar to decision trees and other classifiers. The naïve Bayesian algorithm can accept both numeric and nominal attributes.

Step 1: Data Preparation

The Golf data set shown in Table 4.8 is available in RapidMiner under Sample > Data in the repository section. The Golf data set can just be clicked and dropped in the process area to source all 14 records of the data set. Within the same repository folder, there is also a Golf-Test data set with a set of 14 records used for testing. Both data sets need to be added in Main process area. Since the Bayes operator accepts numeric and nominal data types, no other data transformation process is necessary. Sampling is a common method to extract the training data set from a large data set. It is especially important for naïve Bayesian modeling for the training data set to be representative and proportional to the underlining data set. Hence, random sampling is recommended instead of class-stratified sampling techniques.

Step 2: Modeling Operator and Parameters

The Naïve Bayes operator (Modeling > Classification) can now be connected to the Golf training data set. The Naïve Bayesian operator has only one parameter option to set: whether or not to include Laplace correction. For smaller data sets, Laplace correction is strongly encouraged, as a data set may not have all combinations of attribute values for every class value. In fact, by default, Laplace correction is checked. Outputs of the Naïve Bayes operator are the model and original training data set. The model output should be connected to Apply Model (Model Application folder) to execute the model on the test data set. The output of the Apply Model operator is the labeled test data set and the model.

Step 3: Evaluation

The labeled data set that we have after using the Apply Model operator is then connected to the Performance – Classification operator to evaluate the performance of the classification model. The Performance – Classification operator can be found under Evaluation > Performance Measurement > Performance. Figure 4.35 shows the complete naïve Bayesian predictive classification process. The output ports can be connected to result ports and the process can be saved and executed. The RapidMiner process is also available for download from the companion website www.LearnPredictiveAnalytics.com.

Step 4: Execution and Interpretation

The process shown in Figure 4.35 will has three result outputs: a model description, performance vector, and labeled data set. The labeled data set contains the test data set with the predicted class as an added column. The labeled data set also contains the confidence for each label class, which indicates the likelihood of each label class value.
The model description result contains more information on class conditional probabilities of all the input attributes, derived from the training data set. The Charts tab in model description contains probability density functions for the attributes, as shown in Figure 4.36. In the case of continuous attributes, we can discern the decision boundaries across the different class labels for the Humidity attribute. We can see when Humidity exceeds 82, the likelihood of Play = no increase. The distribution table is shown in Figure 4.37 with all attribute values and corresponding probability measures. The Distribution Table tab in the model description provides the familiar class conditional probability table similar to Table 4.5 and Table 4.6.
image
Figure 4.35 Data mining process for Naive Bayes algorithm.
image
Figure 4.36 Naive Bayes model output: Probability density function for Humidity attribute.
image
Figure 4.37 Naive Bayes distribution table output.
The performance vector output is similar to previously discussed classification algorithms. The performance vector provides the confusion matrix describing accuracy, precision, and recall metrics for the predicted test data set.

4.4.3. Conclusion

The Bayesian algorithm provides a probabilistic framework for a classification problem. It has a simple and sound foundation for modeling the data and is quite robust to outliers and missing values. This algorithm is deployed widely in text mining and document classification where the application has a large set of attributes and attribute values to compute. In our experience, the naïve Bayesian classifier is often a great place to start and is very workable as an initial model in the proof of concept (POC) stage for an analytics project. It also serves as a good benchmark for comparison to other models. Implementation of the Bayesian model in production systems is quite straightforward and the use of data mining tools is optional. One major limitation of the model is the assumption of independent attributes, which can be mitigated by advanced modeling or decreasing the dependence across the attributes through preprocessing. The uniqueness of the technique is that it leverages new information as it arrives and tries to make a best prediction considering new evidence. In this way, it is quite similar to how our mind works. Talking about the mind, the next algorithm mimics the biological process of human neurons!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset