Naive Bayes Platform Overview

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

The Naive Bayes platform classifies observations into classes that are defined by the levels of a categorical response variable. The variables (or factors) that are used for classification are often called features in the data mining literature.

For each class, the naive Bayes algorithm computes the conditional probability of each feature value occurring. If a feature is continuous, its conditional marginal density is estimated. The naive Bayes technique assumes that, within a class, the features are independent. (This is the reason that the technique is referred to as “naive”.) Classification is based on the idea that an observation whose feature values have high conditional probabilities within a certain class has a high probability of belonging to that class. See Hastie et al. (2001).

Because the algorithm estimates only one-dimensional densities or distributions, the algorithm is extremely fast. This makes it suitable for large data sets, and in particular, data sets with large numbers of features. All nonmissing feature values for an observation are used in calculating the conditional probabilities.

Each observation is assigned a naive score for each class. An observation’s naive score for a given class is the proportion of training observations that belong to that class multiplied by the product of the observation’s conditional probabilities. The naive probability that an observation belongs to a class is its naive score for that class divided by the sum of its naive scores across all classes. The observation is assigned to the class for which it has the highest naive probability.

Caution: Because the conditional probabilities of class membership are assumed independent, the naive Bayes estimated probabilities are inefficient.

Naive Bayes requires a large number of training observations to ensure representation for all predictor values and classes. If a new observation is being classified and it has a categorical predictor value that was missing in the Training set, then the platform will use the non-missing features to predict. However, if you save a prediction formula, that formula does not handle missing values.

For more information about the naive Bayes technique, see Hand et al. (2016), and Shmueli et al. (2010)

Example of Naive Bayes

You have baseline medical data for 442 diabetic patients. You also have a binary measure of diabetes disease progression obtained one year after each patient’s initial visit. This measure quantifies disease progression as being either Low or High. You want to construct a classification model to be used in predicting the disease progression for future patients as High or Low.

1. Select Help > Sample Data Library and open Diabetes.jmp.

2. Select Analyze > Predictive Modeling > Naive Bayes.

3. Select Y Binary and click Y, Response.

4. Select Age through Glucose and click X, Factor.

5. Select Validation and click Validation.

6. Click OK.

Figure 9.2 Naive Bayes Report

The Training Set has about a 21% misclassification rate and the Validation Set has about a 24% misclassification rate. The Confusion matrix suggests that, for both the Training and Validation sets, the larger source of misclassification comes from classifying patients with Low disease progression as having High disease progression. The Validation set results indicate how your model extends to independent observations.

You are interested in which individual predictors have the greatest impact on the naive Bayes classification.

7. Click the red triangle next to Naive Bayes and select Profiler.

Figure 9.3 Prediction Profiler for Disease Progression

8. Click the red triangle next to Prediction Profiler and select Assess Variable Importance > Independent Uniform Inputs.

Figure 9.4 Variable Importance

The Summary Report indicates that HDL, BMI, and LTG have the greatest impact on the estimated probabilities.

Figure 9.5 Marginal Model Plots Report

The second row of plots in the Marginal Model Plots report shows that higher values of HDL are associated with a lower probability of classifying a patient as High. Also, higher BMI and LTG values are associated with a higher probability of classifying a patient as High.

Launch the Naive Bayes Platform

Launch the Naive Bayes platform by selecting Analyze > Predictive Modeling > Naive Bayes.

Figure 9.6 Naive Bayes Launch Window

The Naive Bayes launch window provides the following options:

Y, Response

The categorical response column whose values are the classes of interest.

X, Factor

Categorical or continuous predictor columns.

Weight

A column whose numeric values assign a weight to each row in the analysis.

Freq

A column whose numeric values assign a frequency to each row in the analysis.

Validation

A numeric column that contains at most three distinct values. See “Validation” in the “Partition Models” chapter.

Note: If neither a Validation column or a Validation Portion is specified in the launch window and if there are excluded rows, these rows are treated as a Validation set.

A column or columns whose levels define separate analyses. For each level of the specified column, the corresponding rows are analyzed using the other variables that you have specified. The results are presented in separate reports. If more than one By variable is assigned, a separate analysis is produced for each possible combination of the levels of the By variables.

Validation Portion

The portion of the data to be used as the Validation set. See “Validation” in the “Partition Models” chapter.

The Naive Bayes Report

After you click OK in the launch window, the Naive Bayes report appears. By default, the Naive Bayes report contains a report for the response column and a Confusion Matrix report.

Response Column Report

The response column report shows performance statistics for the naive Bayes classification in a summary table for the Training set, and the Validation and Test sets if they are specified. The summary tables contain the following columns:

Count

Number of observations in the set corresponding to the table (Training, Validation, or Test set).

Misclassification Rate

Proportion of observations in the corresponding set that are misclassified by the model. This is calculated as Misclassifications divided by Count.

Misclassifications

Number of observations in the corresponding set that are classified incorrectly.

Confusion Matrix Report

The Confusion Matrix report shows a confusion matrix for the Training set, and for the Validation and Test sets if they are specified. A confusion matrix is a two-way classification of actual and predicted responses.

Naive Bayes Platform Options

The Naive Bayes red triangle menu contains the following options:

Save Predicteds

Saves the predicted classifications to the data table in a column called Naive Predicted <Y, Response>.

Save Prediction Formula

Saves a column called Naive Predicted Formula <Y, Response> to the data table. This column contains the prediction formula for the classifications.

Save Probability Formula

Saves columns to the data table that contain formulas used for classifying each observation. Three groups of columns are saved:

Naive Score <Class>, Sum

For each column that represents a class, this column gives a score formula that measures strength of membership in the given class. In the Naive Score Sum column., these scores are summed across classes. See “Saved Probability Formulas”.

Naive Prob <Class>

For each class, this column gives a formula for the conditional probability that an observation is in that class. See “Saved Probability Formulas”.

Naive Predicted Formula <Y, Response>

Gives the formula for the predicted class.

Publish Probability Formula

Creates probability formulas and saves them as formula column scripts in the Formula Depot platform. If a Formula Depot report is not open, this option opens the Formula Depot window. See the Formula Depot chapter in the Predictive and Specialized Modeling book.

Profiler

Shows or hides an interactive profiler report. Changes in the factor values are reflected in the estimated classification probabilities. See the Profiler chapter in the Profilers book for more information.

See the JMP Reports chapter in the Using JMP book for more information about the following options:

Local Data Filter

Shows or hides the local data filter that enables you to filter the data used in a specific report.

Redo

Contains options that enable you to repeat or relaunch the analysis. In platforms that support the feature, the Automatic Recalc option immediately reflects the changes that you make to the data table in the corresponding report window.

Save Script

Contains options that enable you to save a script that reproduces the report to several destinations.

Save By-Group Script

Contains options that enable you to save a script that reproduces the platform report for all levels of a By variable to several destinations. Available only when a By variable is specified in the launch window.

Additional Example of Naive Bayes

You have historical financial data for 5,960 customers who applied for home equity loans. Each customer was classified as being a Good Risk or Bad Risk. There is missing data on most of the predictors. You want to construct a model to use in classifying the credit risk of future customers.

1. Select Help > Sample Data Library and open Equity.jmp.

2. Select Analyze > Predictive Modeling > Naive Bayes.

3. Select BAD and click Y, Response.

One of the potential predictors, DEBTINC, has many missing values that might be informative. However, naive Bayes is not prepared to handle large number of missing values well, so you do not include DEBTINC in your model.

4. Select LOAN through CLNO and click X, Factor.

5. Select Validation and click Validation.

6. Click OK.

Figure 9.7 Naive Bayes Report for BAD

The Training, Validation, and Test sets show misclassification rates between 18% and 19%. The confusion matrices for all of the sets suggest that the largest source of misclassification is the classification of Bad Risk customers as Good Risk customers.

You are interested in the probabilities that customers with certain financial background values are classified as High Risk.

7. Click the red triangle next to Naive Bayes and select Save Probability Formulas.

Three sets of columns are added to the data table. Notice that observations with any missing predictor values have missing values in the new columns.

‒ The three Naive Score columns contain naive score formulas for Good Risk, Bad Risk, and the sum of both.

‒ The two Naive Prob columns contain probability formulas for Good Risk and Bad Risk.

‒ The Naive Predicted Formula Bad column contains a formula that assigns an observation to the class for which the observation has the highest naive probability.

Use these formulas to score new customers. For details about the formula columns, see “Saved Probability Formulas”.

Statistical Details for the Naive Bayes Platform

Algorithm

The naive Bayes method classifies an observation into the class for which its probability of membership, given the values of its features, is highest. The method assumes that the features are conditionally independent within each class.

Denote the possible classifications by C1, …, Ck. Denote the features, or predictors, by X1, X2, …, Xp.

The conditional probability that an observation with Xj = xj belongs to the class Cr is given as follows:

• If Xj is categorical: P(Cr|xj)

• If Xj is continuous:

Here, φ is the standard normal density function, and m and s are the mean and standard deviation, respectively, of the predictor values within the class Cr.

The conditional probability of that an observation with predictor values x1, x2, …, xp belongs in the class Cr is computed as follows:

An observation is classified into the class for which its conditional probability is the largest.

Saved Probability Formulas

This section describes the formulas saved using the Save Probability Formula option. The conditional probability that an observation with predictor values x1, x2, …, xp belongs in the class Cr is given by

as shown in the section “Algorithm”.

Naive Score Formulas

The Naive Score formula for a given class Cr is the numerator in the expression for

The Naive Score Sum formula sums the conditional probabilities

over all classes. This is the denominator in the expression for

Naive Prob Formulas

The Naive Prob formula for a given class Cr equals

Naive Predicted Formula

The Naive Predicted Formula for an observation classifies that observation into the class for which

is the largest. This is equivalent to classifying an observation into the class for which its Naive Score formula is the largest.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Naive Bayes Platform Overview

Create new playlist

Sign In

Sign Up

Table of Contents for
Naive Bayes Platform Overview