The data mining algorithms used for classification tasks are quite diverse. The objective of these algorithms is the same—prediction of a target variable. But, the method of predicting is drawn from a range of multidisciplinary techniques. The naïve Bayes algorithm finds its roots in statistics and probability theory. In general, classification techniques try to predict class labels based on attributes by best approximating the relationship between input and output variables. Every day, we mentally estimate a myriad of outcomes based on past evidence. Consider the process of guessing commuting time to work. First, commute time depends heavily on when you are traveling. If you are traveling during peak hours, most likely the commute is going to be longer. Weather conditions like rain, snow, or dense fog will slow down the commute. If the day is a school holiday, like summer break, then the commute will be lighter than on school days. If there is any planned road work ahead, the commute usually takes longer. When more than one adverse factor is at
play, then the commute will be even longer than if it is just one isolated factor. All this if-then knowledge is based on previous experience of commuting when one or more factors come into play. Our experience creates a model in our brain and we mentally run the model before pulling out of the driveway!
Let’s take the case of defaults in home mortgages and assume the average overall default rate is 2%. The likelihood of an average person defaulting on their mortgage loan is 2%. However, if a given individual’s credit history is above average (or excellent), then the likelihood of their default would be less than average. Furthermore, if we know that the person’s annual salary is above average with respect to loan value, then the likelihood of default falls further. As we obtain more evidence on the factors that impact the outcome, we can make improved guesses about the outcome using probability theory. The naïve Bayesian algorithm basically leverages the probabilistic relationship between the factors (attributes) and the class label (outcome). The algorithm makes a strong and sometimes naïve assumption of independence between the attributes, thus its name. The independence assumption between attributes may not always hold true. In some cases, we can assume annual income and credit score are independent of each other. However, in many cases we just don’t know. If one of the factors for the default rate is home value, then we have a scenario where both the annual income and home value factors are correlated and thus not independent. Homeowners with high income tend to buy more expensive houses. The independence assumption doesn’t hold true always, but the simplicity and robustness of the algorithm offsets the limitation introduced by the independence assumption.
Predicting and Filtering Spam Email
Spam is unsolicited bulk email sent to a wide number of email users. At best it is an annoyance to recipients but many of the spam emails hide a malicious intent by hosting false advertisements or redirecting clicks to phishing sites. Filtering spam email is one of the essential features provided by email service providers and administrators. The key challenge is balance between incorrectly flagging a legitimate email as spam (false positive) versus not catching all the spam messages. There is no perfect spam filtering solution and spam detecting is a catch-up game. The spammers always try to deceive and outsmart the spam filters and email administrators fortify the filters for various new spam scenarios. Automated spam filtering based on algorithms provides a promising solution in containing spam and a learning framework to update the filtering solutions (
Prosess Software, 2013).
Some words occur in spam emails more often than in legitimate email messages. For example, the probability of occurrence for words like free, mortgage, credit, sale, Viagra, etc. is higher in spam mails than in normal emails. We can calculate the exact probabilities if we have a sample of previously known spam emails and regular emails. Based on the known word probabilities, we can compute the overall probability of an email being spam based on all the words in the email and the probability of each word being in spam versus regular emails. This is the foundation of Bayesian spam filtering systems (
Zdziarski, 2005). Any wrongly classified spam messages that are subsequently reclassified by the user is an opportunity to refine the model, making spam filtering adaptive to new spam techniques. Though recent spam reduction uses a combination of different algorithms, Bayesian-based spam filtering remains one of the foundational elements of spam prediction systems (
Sahami et al., 1998).
4.4.1. How it Works
The naïve Bayesian algorithm is built on Bayes’ theorem, named after Reverend Thomas Bayes. Bayes’ work is described in “Essay Towards Solving a Problem in the Doctrine of Chances” (1763), published posthumously in the Philosophical Transactions of the Royal Society of London by Richard Price. Bayes’ theorem is one of the most influential and important concepts in statistics and probability theory. It provides a mathematical expression for how a degree of subjective belief changes to account for new evidence. First, let’s discuss the terminology used in Bayes’ theorem.
Assume X is the evidence (or factors or attribute set) and Y is the outcome (or target or label class). Here X is a set, not individual attributes, hence X = {X1, X2, X3, …, Xn}, where Xi is an individual attribute, such as credit rating. The probability of outcome P(Y) is called prior probability, which can be calculated from the data set. Prior probability shows the likelihood of an outcome in a given data set. For example, in the mortgage case, P(Y) is the default rate of a home mortgage, which is 2%. P(Y|X) is called the conditional probability, which provides the probability of an outcome given the evidence when we know the value of X. Again, using the mortgage example, P(Y|X) is the average rate of default given that an individual’s credit history is known. If the credit history is excellent, then the probability of default is likely to be less than 2%. P(Y|X) is also called posterior probability. Calculating posterior probability is the objective of predictive analytics using Bayes’ theorem. This is the likelihood of an outcome as we learn the values of the input attributes.
Bayes’ theorem states that
(4.13) P(X|Y) is another conditional probability, called the class conditional probability. P(X|Y) is the probability that an attribute assumes a particular value given the class label. Like P(Y), P(X|Y) can be calculated from the data set as well. If we know the training set of loan defaults, we can calculate the probability of an “excellent” credit rating given that the default is a “yes.” As indicated in Bayes’ theorem, class conditional probability is crucial in calculating posterior probability. P(X) is basically the probability of the evidence. In the mortgage example, this is simply the proportion of individuals with a given credit rating. To classify a new record, we can compute P(Y|X) for each class of Y and see which probability “wins.” Class label Y with the highest value of P(Y|X) wins for a particular attribute value X. Since P(X) is the same for every class value of the outcome, we don’t have to calculate this and assume it as a constant. More generally, in an example set with n attributes X = {X1, X2, X3 … Xn},
(4.14) If we know how to calculate class conditional probability P(
X|Y) or
, then it is easy to calculate posterior probability P(Y|
X). Since
P(X) is constant for every value of Y, it is enough to calculate the numerator of the equation
for every class value.
To further explain how the naïve Bayesian algorithm works, let’s use the modified Golf data set shown in
Table 4.4. The Golf table is an artificial data set with four attributes and one class label. Note that we are using the nominal data table for easier explanation (temperature and humidity have been converted from the numeric type). In Bayesian terms, weather condition is the evidence and decision to play or not play is the belief. Altogether there are 14 examples with 5 examples of Play = no and nine examples of Play = yes. The objective is to predict if the player will Play (yes or no), given the information about a few weather-related measures, based on learning from the data set in
Table 4.4. Here is the step-by-step explanation of how the Bayesian model works.
Step 1: Calculating Prior Probability P(Y)
Prior probability P(Y) is the probability of an outcome. In this example set there are two possible outcomes: Play = yes and Play = no. From
Table 4.4, out of 14 records there are 5 records with the “no” class and 9 records with the “Yes” class. The probability of outcome is
Table 4.4
Golf Data Set with Modified Temperature and Humidity Attributes
No. | Temperature X1 | Humidity X2 | Outlook X3 | Wind X4 | Play (Class Label) Y |
1 | high | med | sunny | false | no |
2 | high | high | sunny | true | no |
3 | low | low | rain | true | no |
4 | med | high | sunny | false | no |
5 | low | med | rain | true | no |
6 | high | med | overcast | false | yes |
7 | low | high | rain | false | yes |
8 | low | med | rain | false | yes |
9 | low | low | overcast | true | yes |
10 | low | low | sunny | false | yes |
11 | med | med | rain | false | yes |
12 | med | low | sunny | true | yes |
13 | med | high | overcast | true | yes |
14 | high | low | overcast | false | yes |
Since the probability of an outcome is calculated from the data set, it is important that the data set used for data mining is representative of the population, if sampling is used. A class-stratified sampling of data from the population will not be compatible for naïve Bayesian modeling.
Step 2: Calculating Class Conditional Probability P(Xi|Y)
Class conditional probability is the probability of
each attribute value for an attribute, for each outcome value. This calculation is repeated for all the attributes: Temperature (X
1), Humidity (X
2), Outlook (X
3), and Wind(X
4), and for every distinct outcome value. Let’s calculate the class conditional probability of Temperature (X
1). For each value of the Temperature attribute, we can calculate P(X
1|Y = no) and P(X
1|Y = yes) by constructing a probability table as shown in
Table 4.5. From the data set there are five Y = no records and nine Y = yes records. Out of the five Y = no records, we can also calculate the probability of occurrence when the temperature is high, medium, and low. The values will be 2/5, 1/5, and 2/5, respectively. We can repeat the same process when the outcome Y = yes.
Similarly, we can repeat the calculation to find the class conditional probability for the other three attributes: Humidity (X
2), Outlook (X
3), and Wind(X
4). This class conditional probability table is shown in
Table 4.6.
Table 4.5
Class Conditional Probability of Temperature
Temperature (X1) | P(X1|Y = no) | P(X1|Y = yes) |
high | 2/5 | 2/9 |
med | 1/5 | 3/9 |
low | 2/5 | 4/9 |
Table 4.6
Conditional Probability of Humidity, Outlook, and Wind
Humidity (X2) | P(X1|Y = no) | P(X1|Y = yes) |
high | 2/5 | 2/9 |
low | 1/5 | 4/9 |
med | 2/5 | 3/9 |
Outlook (X3) | P(X1|Y = no) | P(X1|Y = yes) |
overcast | 0/5 | 4/9 |
Rain | 2/5 | 3/9 |
sunny | 3/5 | 2/9 |
Wind (X4) | P(X1|Y = no) | P(X1|Y = yes) |
false | 2/5 | 6/9 |
true | 3/5 | 3/9 |
Table 4.7
Test Record
No. | Temperature X1 | Humidity X2 | Outlook X3 | Wind X4 | Play (Class Label) Y |
Unlabeled Test | high | low | sunny | false | ? |
Step 3: Predicting the Outcome Using Bayes’ Theorem
We are all set with preparing class conditional probability tables and now they can be used in the future prediction task. If a new, unlabeled test record (
Table 4.7) has the attribute values Temperature= high, Humidity = low, Outlook = sunny, and Wind = false, what would be the class label prediction? Play = Yes or No? The outcome class can be predicted based on Bayes’ theorem by calculating the posterior probability P(Y|
X) for both values of Y. Once P(Y = yes|
X) and P(Y = no|
X) are calculated, we can determine which outcome has higher probability and the predicted outcome is the one that has the highest probability. While calculating both class conditional probabilities using Equation
4.14, it is sufficient to just calculate
as P(
X) is going to be same for both the outcome classes.
P(Y = yes|
X) =
P(Y)∗∏ni=1P(Xi|Y)P(X)= P(Y = yes) ∗ {P(Temp = high|Y = yes) ∗ P(Humidity = low|Y = yes) ∗ P(Outlook = sunny| Y = yes) ∗ P(Wind = false|Y = yes)}/P(X)
= 9/14 ∗ {2/9 ∗ 4/9 ∗ 2/9 ∗ 6/9}/P(X)
P(Y = no|X) = 5/14 ∗ {2/5 ∗ 4/5 ∗ 3/5 ∗ 2/5}
We normalize both the estimates by dividing both by (0.0094 + 0.027) to get
Likelihood of (Play = yes) =
0.00940.0274+0.0094 = 26%
Likelihood of (Play = no) =
0.00940.0274+0.0094 = 74%
In this case P(Y = yes|X) < P(Y = no|X), hence the prediction for the unlabeled test record will be Play = no.
Bayesian modeling is relatively simple to understand once you get past the notation (for beginners) and easy to implement in practically any programing language. The computation for model building is quite simple and involves the creation of a lookup table of probabilities. Bayesian modeling is quite robust in handling missing values. If the test example set does not contain a
value, let’s suppose temperature is not calculated in the example set, the Bayesian model simply omits the corresponding class conditional probability for the outcomes. Having missing values in the test set would be difficult to handle in decision trees and regression algorithms, particularly when the missing attribute is used higher up in the node of the decision tree or has more weight in regression. Even though the naïve Bayes algorithm is quite robust to missing attributes, it does have a few limitations. Here are couple of the most significant limitations and methods of mitigation.
Issue 1: Incomplete Training Set
Problems arise when an attribute value in the testing record has no example in the training record. In the Golf dataset (
Table 4.4), if a test example consists of the attribute value Outlook = overcast, the probability of P(Outlook = overcast|Y = no) is zero. Even if one of the attribute’s class conditional probabilities is zero, by nature of the Bayesian equation, the entire posterior probability will be zero.
P(Y = no|X) = P(Y = No) ∗ {P(Temp = high|Y = no) ∗ P(Humidity = low| Y = no) ∗ P(Outlook = overcast|Y = no) ∗ P(Wind = false|Y = no)}/P(X)
= 5/14 ∗ {2/5 ∗ 1/5 ∗ 0 ∗ 2/5}/P(X)
In this case P(Y = yes|X) > P(Y = no|X), and the test example will be classified as Play = yes. If there are no training records for any other attribute value, like Temperature = low for outcome yes, then probability of both outcomes, P(Y = no|X) and P(Y = yes|X), will also be zero and an arbitrary prediction shall be made because of the dilemma.
To mitigate this problem, we can assign small default probabilities for the missing records instead of zero. With this, the absence of an attribute value doesn’t wipe out the value of P(X|Y), albeit it will reduce the probability to small number. This technique is called Laplace correction. Laplace correction adds a controlled error in all class conditional probabilities.
In the above data set, if the example set contains Outlook = overcast, then P(X|Y = no) = 0. The class conditional probability for all the three values for Outlook is 0/5, 2/5, and 3/5, Y = no. We can add controlled error by adding 1 to all numerators and 3 for all denominators, so the class conditional probabilities are 1/8, 3/8 and 4/8. The sum of all the class conditional probabilities is still 1. Generically, the Laplace correction is given by
correctedprobabilityP(Xi|Y)=0+μp35+μ,2+μp25+μ,3+μp25+μ
(4.15)
where
p1 +
p2 +
p3 = 1 and
μ is the correction.
Issue 2: Continuous Attributes
If an attribute has continuous numeric values instead of nominal values, the solution discussed above will not work. We can always convert the continuous values to nominal values by discretization and use the same approach as discussed. But discretization requires exercising subjective judgment on the bucketing range, leading to loss of information. Instead, we can preserve the continuous values as such and use the probability density function. We assume the probability distribution for a numerical attribute follows a normal or Gaussian distribution. If the attribute value is known to follow some other distribution, such as Poisson, the equivalent probability density function can be used. The probability density function for a normal distribution is given by
f(x)=12πσ√e(x−μ)22σ2
(4.16)
where
μ is the mean and
σ is the standard deviation of the sample.
In the updated Golf data set shown in
Table 4.8, temperature and humidity are continuous attributes. In such a situation, we compute the mean and standard deviation for both class labels (Play = yes and Play = no) for temperature and humidity (
Table 4.9).
If an unlabeled test record has a Humidity value of 78, we can compute the probability density using the Equation
4.16, for both outcomes. For outcome
Play = yes, if we plug in the values
x = 78,
μ = 73, and
σ = 6.16 to the probability density function, the equation renders the value 0.04. Similarly for outcome Play = no, we can plug in
x = 78, μ = 74.6,
σ = 7.89 and compute the probability density to obtain 0.05:
Table 4.8
Golf Data Set with Continuous Attributes
No. | Outlook X1 | Humidity X2 | Temperature X3 | Wind X4 | Play Y |
1 | sunny | 85 | 85 | false | no |
2 | sunny | 80 | 90 | true | no |
6 | rain | 65 | 70 | true | no |
8 | sunny | 72 | 95 | false | no |
14 | rain | 71 | 80 | true | no |
3 | overcast | 83 | 78 | false | yes |
4 | rain | 70 | 96 | false | yes |
5 | rain | 68 | 80 | false | yes |
7 | overcast | 64 | 65 | true | yes |
9 | sunny | 69 | 70 | false | yes |
10 | rain | 75 | 80 | false | yes |
11 | sunny | 75 | 70 | true | yes |
12 | overcast | 72 | 90 | true | yes |
13 | overcast | 81 | 75 | false | yes |
Table 4.9
Mean and Deviation for Continuous Attributes
Play Value | | Humidity X2 | Temperature X3 |
Y = no | Mean | 74.60 | 84.00 |
| Deviation | 7.89 | 9.62 |
Y = yes | Mean | 73.00 | 78.22 |
| Deviation | 6.16 | 9.88 |
P(temperature = 78|Y = yes) = 0.04
P(temperature = 78|Y = no) = 0.05
The above values are probability densities and
not probabilities. In a continuous scale, the probability of temperature being exactly at a particular value is zero. Instead, the probability is computed for a range, such as temperatures from 77.5 to 78.5 units. Since the same range is used for computing the probability density for both the outcomes, Play = yes and Play = no, it is not necessary to compute the actual probability. Hence we can substitute the above values in the Bayesian equation
4.14 for calculating class conditional probability P(
X|Y).
Issue 3: Attribute Independence
One of the fundamental assumptions in the naïve Bayesian model is
attribute independence. Bayes’ theorem is guaranteed only for independent attributes. In many real-life cases, this is quite a stringent condition to deal with. This is why the technique is called “naïve” Bayesian, because it assumes attributes independence. However, in practice the naïve Bayesian model works fine with slightly correlated features (
Rish, 2001). We can handle this problem by pre-processing the data. Before applying the naïve Bayesian algorithm, it makes sense to remove strongly correlated attributes. In the case of all numeric attributes, this can be achieved by computing a weighted correlation matrix. An advanced application of Bayes’ theorem, called a Bayesian belief network, is designed to handle data sets with attribute dependencies.
The independence of two categorical (nominal) attributes can be tested by the
chi-square (χ2) test for independence. The chi-square test can be calculated by creating a contingency table of observed frequency like the one shown in
Table 4.10A. A contingency table is a simple cross tab of two attributes under consideration.
Table 4.10
Contingency Tables with Observed Frequency (A) and Expected Frequency (B)
| Wind | | Wind |
Outlook | False | True | Total | Outlook | False | True | Total |
overcast | 2 | 2 | 4 | overcast | 2.29 | 1.71 | 4 |
rain | 3 | 2 | 5 | rain | 2.86 | 2.14 | 5 |
sunny | 3 | 2 | 5 | sunny | 2.86 | 2.14 | 5 |
Total | 8 | 6 | 14 | Total | 8 | 6 | 14 |
A contingency table of expected frequency (
Table 4.10b) can also be created based on the following equation:
Er,c=(rowtotal∗columntotal)(tabletotal)
(4.17)
The chi-square statistic (χ
2) calculates the sum of the difference between these two tables. χ
2 is calculated by Equation
4.18. In this equation, O is observed frequency and E is expected frequency:
χ2=∑(O−E)2/E
(4.18)
If the chi-square statistic (χ2) is less than the critical value calculated from the chi-square distribution for a given confidence level, then we can assume the two variables under consideration are independent, for practical purposes. This entire test can be performed in statistical tools or in Microsoft Excel.