Image shown hereNaive Bayes Platform Overview
The Naive Bayes platform classifies observations into classes that are defined by the levels of a categorical response variable. The variables (or factors) that are used for classification are often called features in the data mining literature.
For each class, the naive Bayes algorithm computes the conditional probability of each feature value occurring. If a feature is continuous, its conditional marginal density is estimated. The naive Bayes technique assumes that, within a class, the features are independent. (This is the reason that the technique is referred to as “naive”.) Classification is based on the idea that an observation whose feature values have high conditional probabilities within a certain class has a high probability of belonging to that class. See Hastie et al. (2001).
Because the algorithm estimates only one-dimensional densities or distributions, the algorithm is extremely fast. This makes it suitable for large data sets, and in particular, data sets with large numbers of features. All nonmissing feature values for an observation are used in calculating the conditional probabilities.
Each observation is assigned a naive score for each class. An observation’s naive score for a given class is the proportion of training observations that belong to that class multiplied by the product of the observation’s conditional probabilities. The naive probability that an observation belongs to a class is its naive score for that class divided by the sum of its naive scores across all classes. The observation is assigned to the class for which it has the highest naive probability.
Caution: Because the conditional probabilities of class membership are assumed independent, the naive Bayes estimated probabilities are inefficient.
Naive Bayes requires a large number of training observations to ensure representation for all predictor values and classes. If a new observation is being classified and it has a categorical predictor value that was missing in the Training set, then the platform will use the non-missing features to predict. However, if you save a prediction formula, that formula does not handle missing values.
For more information about the naive Bayes technique, see Hand et al. (2016), and Shmueli et al. (2010)
Image shown hereExample of Naive Bayes
You have baseline medical data for 442 diabetic patients. You also have a binary measure of diabetes disease progression obtained one year after each patient’s initial visit. This measure quantifies disease progression as being either Low or High. You want to construct a classification model to be used in predicting the disease progression for future patients as High or Low.
1. Select Help > Sample Data Library and open Diabetes.jmp.
2. Select Analyze > Predictive Modeling > Naive Bayes.
3. Select Y Binary and click Y, Response.
4. Select Age through Glucose and click X, Factor.
5. Select Validation and click Validation.
6. Click OK.
Figure 9.2 Naive Bayes Report
Naive Bayes Report
The Training Set has about a 21% misclassification rate and the Validation Set has about a 24% misclassification rate. The Confusion matrix suggests that, for both the Training and Validation sets, the larger source of misclassification comes from classifying patients with Low disease progression as having High disease progression. The Validation set results indicate how your model extends to independent observations.
You are interested in which individual predictors have the greatest impact on the naive Bayes classification.
7. Click the red triangle next to Naive Bayes and select Profiler.
Figure 9.3 Prediction Profiler for Disease Progression
Prediction Profiler for Disease Progression
8. Click the red triangle next to Prediction Profiler and select Assess Variable Importance > Independent Uniform Inputs.
Figure 9.4 Variable Importance
Variable Importance
The Summary Report indicates that HDL, BMI, and LTG have the greatest impact on the estimated probabilities.
Figure 9.5 Marginal Model Plots Report
Marginal Model Plots Report
The second row of plots in the Marginal Model Plots report shows that higher values of HDL are associated with a lower probability of classifying a patient as High. Also, higher BMI and LTG values are associated with a higher probability of classifying a patient as High.
Image shown hereLaunch the Naive Bayes Platform
Launch the Naive Bayes platform by selecting Analyze > Predictive Modeling > Naive Bayes.
Figure 9.6 Naive Bayes Launch Window
Naive Bayes Launch Window
The Naive Bayes launch window provides the following options:
Y, Response
The categorical response column whose values are the classes of interest.
X, Factor
Categorical or continuous predictor columns.
Weight
A column whose numeric values assign a weight to each row in the analysis.
Freq
A column whose numeric values assign a frequency to each row in the analysis.
Validation
A numeric column that contains at most three distinct values. See “Validation” in the “Partition Models” chapter.
Note: If neither a Validation column or a Validation Portion is specified in the launch window and if there are excluded rows, these rows are treated as a Validation set.
By
A column or columns whose levels define separate analyses. For each level of the specified column, the corresponding rows are analyzed using the other variables that you have specified. The results are presented in separate reports. If more than one By variable is assigned, a separate analysis is produced for each possible combination of the levels of the By variables.
Validation Portion
The portion of the data to be used as the Validation set. See “Validation” in the “Partition Models” chapter.
Image shown hereThe Naive Bayes Report
After you click OK in the launch window, the Naive Bayes report appears. By default, the Naive Bayes report contains a report for the response column and a Confusion Matrix report.
Image shown hereResponse Column Report
The response column report shows performance statistics for the naive Bayes classification in a summary table for the Training set, and the Validation and Test sets if they are specified. The summary tables contain the following columns:
Count
Number of observations in the set corresponding to the table (Training, Validation, or Test set).
Misclassification Rate
Proportion of observations in the corresponding set that are misclassified by the model. This is calculated as Misclassifications divided by Count.
Misclassifications
Number of observations in the corresponding set that are classified incorrectly.
Image shown hereConfusion Matrix Report
The Confusion Matrix report shows a confusion matrix for the Training set, and for the Validation and Test sets if they are specified. A confusion matrix is a two-way classification of actual and predicted responses.
Image shown hereNaive Bayes Platform Options
The Naive Bayes red triangle menu contains the following options:
Save Predicteds
Saves the predicted classifications to the data table in a column called Naive Predicted <Y, Response>.
Save Prediction Formula
Saves a column called Naive Predicted Formula <Y, Response> to the data table. This column contains the prediction formula for the classifications.
Save Probability Formula
Saves columns to the data table that contain formulas used for classifying each observation. Three groups of columns are saved:
Naive Score <Class>, Sum
For each column that represents a class, this column gives a score formula that measures strength of membership in the given class. In the Naive Score Sum column., these scores are summed across classes. See “Saved Probability Formulas”.
Naive Prob <Class>
For each class, this column gives a formula for the conditional probability that an observation is in that class. See “Saved Probability Formulas”.
Naive Predicted Formula <Y, Response>
Gives the formula for the predicted class.
Publish Probability Formula
Creates probability formulas and saves them as formula column scripts in the Formula Depot platform. If a Formula Depot report is not open, this option opens the Formula Depot window. See the Formula Depot chapter in the Predictive and Specialized Modeling book.
Profiler
Shows or hides an interactive profiler report. Changes in the factor values are reflected in the estimated classification probabilities. See the Profiler chapter in the Profilers book for more information.
See the JMP Reports chapter in the Using JMP book for more information about the following options:
Local Data Filter
Shows or hides the local data filter that enables you to filter the data used in a specific report.
Redo
Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.
Save Script
Contains options that enable you to save a script that reproduces the report to several destinations.
Save By-Group Script
Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.
Image shown hereAdditional Example of Naive Bayes
You have historical financial data for 5,960 customers who applied for home equity loans. Each customer was classified as being a Good Risk or Bad Risk. There is missing data on most of the predictors. You want to construct a model to use in classifying the credit risk of future customers.
1. Select Help > Sample Data Library and open Equity.jmp.
2. Select Analyze > Predictive Modeling > Naive Bayes.
3. Select BAD and click Y, Response.
One of the potential predictors, DEBTINC, has many missing values that might be informative. However, naive Bayes is not prepared to handle large number of missing values well, so you do not include DEBTINC in your model.
4. Select LOAN through CLNO and click X, Factor.
5. Select Validation and click Validation.
6. Click OK.
Figure 9.7 Naive Bayes Report for BAD
Naive Bayes Report for BAD
The Training, Validation, and Test sets show misclassification rates between 18% and 19%. The confusion matrices for all of the sets suggest that the largest source of misclassification is the classification of Bad Risk customers as Good Risk customers.
You are interested in the probabilities that customers with certain financial background values are classified as High Risk.
7. Click the red triangle next to Naive Bayes and select Save Probability Formulas.
Three sets of columns are added to the data table. Notice that observations with any missing predictor values have missing values in the new columns.
The three Naive Score columns contain naive score formulas for Good Risk, Bad Risk, and the sum of both.
The two Naive Prob columns contain probability formulas for Good Risk and Bad Risk.
The Naive Predicted Formula Bad column contains a formula that assigns an observation to the class for which the observation has the highest naive probability.
Use these formulas to score new customers. For details about the formula columns, see “Saved Probability Formulas”.
'Image shown hereStatistical Details for the Naive Bayes Platform
Image shown hereAlgorithm
The naive Bayes method classifies an observation into the class for which its probability of membership, given the values of its features, is highest. The method assumes that the features are conditionally independent within each class.
Denote the possible classifications by C1, …, Ck. Denote the features, or predictors, by X1, X2,, Xp.
The conditional probability that an observation with Xj =  xj belongs to the class Cr is given as follows:
If Xj is categorical: P(Cr|xj)
If Xj is continuous:
Equation shown here
Here, φ is the standard normal density function, and m and s are the mean and standard deviation, respectively, of the predictor values within the class Cr.
The conditional probability of that an observation with predictor values x1, x2,, xp belongs in the class Cr is computed as follows:
Equation shown here
An observation is classified into the class for which its conditional probability is the largest.
Image shown hereSaved Probability Formulas
This section describes the formulas saved using the Save Probability Formula option. The conditional probability that an observation with predictor values x1, x2,, xp belongs in the class Cr is given by Equation shown here as shown in the section “Algorithm”.
Naive Score Formulas
The Naive Score formula for a given class Cr is the numerator in the expression for Equation shown here.
The Naive Score Sum formula sums the conditional probabilities Equation shown here over all classes. This is the denominator in the expression for Equation shown here.
Naive Prob Formulas
The Naive Prob formula for a given class Cr equals Equation shown here.
Naive Predicted Formula
The Naive Predicted Formula for an observation classifies that observation into the class for which Equation shown here is the largest. This is equivalent to classifying an observation into the class for which its Naive Score formula is the largest.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset