Example – identifying poisonous mushrooms with rule learners

Each year, many people fall ill and sometimes even die from ingesting poisonous, wild mushrooms. Since many mushrooms are very similar to each other in appearance, occasionally even experienced mushroom gatherers are poisoned.

Unlike the identification of harmful plants such as a poison oak or poison ivy, there are no clear rules like "leaves of three, let them be" for identifying whether a wild mushroom is poisonous or edible. Complicating matters, many traditional rules such as "poisonous mushrooms are brightly colored" provide dangerous or misleading information. If simple, clear, and consistent rules were available for identifying poisonous mushrooms, they could save the lives of foragers.

As one of the strengths of rule-learning algorithms is the fact that they generate easy to understand rules, they seem like an appropriate fit for this classification task. However, the rules will only be as useful as they are accurate.

Step 1 – collecting data

To identify rules for distinguishing poisonous mushrooms, we will utilize the Mushroom dataset donated by Jeff Schlimmer of Carnegie Mellon University to the UCI Machine Learning Repository. The raw data is available at http://archive.ics.uci.edu/ml/datasets/Mushroom.

The dataset includes information on 8,124 mushroom samples from 23 species of gilled mushrooms listed in the Audubon Society Field Guide to North American Mushrooms (1981). In the Field Guide, each of mushroom species is identified as "definitely edible", "definitely poisonous", "likely poisonous, and not recommended to be eaten". For the purposes of this dataset, the latter group was combined with the definitely poisonous group to make two classes: poisonous and non-poisonous. The data dictionary available on the UCI website describes the 22 features of the mushroom samples, including characteristics such as cap shape, cap color, odor, gill size and color, stalk shape, and habitat.

Tip

This chapter uses a slightly-modified version of the mushroom data. If you plan on following along with the example, download the mushrooms.csv file from the Packt Publishing's website and save to your R working directory.

Step 2 – exploring and preparing the data

We begin by using read.csv(), to import the data for our analysis. Since all 22 features and the target class are nominal, in this case we will set stringsAsFactors = TRUE and take advantage of the automatic factor conversion:

> mushrooms <- read.csv("mushrooms.csv", stringsAsFactors = TRUE)

The output of the str(mushrooms) command notes that the data contain 8124 observations of 23 variables as the data dictionary had described. While most of the str() output is unremarkable, one feature is worth mentioning. Do you notice anything peculiar about the veil_type variable in the following line?

$ veil_type : Factor w/ 1 level "partial": 1 1 1 1 1 1 ...

If you think it is odd that a factor variable has only one level, you are correct. The data dictionary lists two levels for this feature: partial and universal, however all examples in our data are classified as partial. It is likely that this variable was somehow coded incorrectly. In any case, since veil_type does not vary across samples, it does not provide any useful information for prediction. We will drop this variable from our analysis using the following command:

> mushrooms$veil_type <- NULL

By assigning NULL to veil_type, R eliminates the feature from the mushrooms data frame.

Before going much further, we should take a quick look at the distribution of the class variable in our dataset, mushroom type. If the class levels are distributed very unevenly—meaning they are heavily imbalanced—some models, such as rule learners, can have trouble predicting the minority class:

> table(mushrooms$type)
   edible poisonous 
     4208      3916

About 52 percent of the mushroom samples (N = 4,208) are edible, while 48 percent (N = 3,916) are poisonous. As the class levels are split into about 50/50, we do not need to worry about imbalanced data.

For the purposes of this experiment, we will consider the 8,214 samples in the mushroom data to be an exhaustive set of all the possible wild mushrooms. This is an important assumption because it means that we do not need to hold some samples out of the training data for testing purposes. We are not trying to develop rules that cover unforeseen types of mushrooms; we are merely trying to find rules that accurately depict the complete set of known mushroom types. Therefore, we can build and test the model on the same data.

Step 3 – training a model on the data

If we trained a hypothetical ZeroR classifier on this data, what would it predict? Since ZeroR ignores all of the features and simply predicts the target's mode, in plain language its rule would state that "all mushrooms are edible." Obviously, this is not a very helpful classifier because it would leave a mushroom gatherer sick or dead for nearly half of the mushroom samples. Our rules will need to do much better than this benchmark in order to provide safe advice that can be published. At the same time, we need simple rules that are easy to remember.

Since simple rules can often be extremely predictive, let's see how a very simple rule learner performs on the mushroom data. Toward this end, we will apply the 1R classifier, which identifies the single feature that is the most predictive of the target class and uses this feature to construct a set of rules.

We will use the 1R implementation in the RWeka package, called OneR(). You may recall that we had installed RWeka in Chapter 1, Introducing Machine Learning, as part of the tutorial on installing and loading packages. If you haven't installed the package per those instructions, you will need to use the command install.packages("RWeka"), and have Java installed on your system (refer to the installation instructions for more details). With those steps complete, load the package by typing library(RWeka).

Step 3 – training a model on the data

OneR() uses the R formula syntax for specifying the model to be trained. The formula syntax uses the ~ operator (known as the tilde), to express the relationship between a target variable and its predictors. The class variable to be learned goes to the left of the tilde, and the predictor features are written on the right, separated by + operators. I you would like to model the relationship between the class y and predictors x1 and x2, you would write the formula as: y ~ x1 + x2. If you would like to include all variables in the model, the special term '.' is used. For example, y ~ . specifies the relationship between y and all other features in the dataset.

Tip

The R formula syntax is used across many R functions and offers some powerful features to describe the relationships among predictor variables. We will explore some of these features in later chapters. However, if you're eager for a sneak peak, feel free to read the documentation using the ?formula command.

Using the formula type ~ ., we will allow our first OneR() rule learner to consider all possible features in the mushroom data when constructing its rules to predict type:

> mushroom_1R <- OneR(type ~ ., data = mushrooms)

To examine the rules it created, we can type the name of the classifier object, in this case mushroom_1R:

> mushroom_1R

odor:
  almond  -> edible
  anise  -> edible
  creosote  -> poisonous
  fishy  -> poisonous
  foul  -> poisonous
  musty  -> poisonous
  none  -> edible
  pungent  -> poisonous
  spicy  -> poisonous
(8004/8124 instances correct)

On the first line of the output, we see that the odor feature was selected for rule generation. The categories of odor, such as almond, anise, and so on, specify rules for whether the mushroom is likely to edible or poisonous. For instance, if the mushroom smells fishy, foul, musty, pungent, spicy, or like creosote, the mushroom is likely to be poisonous. On the other hand, more pleasant smells like almond and anise (or none, that is, no smell at all), indicate edible mushrooms. For the purposes of a field guide for mushroom gathering, these rules could be summarized in a single, simple rule-of-thumb: "if the mushroom smells unappetizing, then it is likely to be poisonous."

Step 4 – evaluating model performance

The last line of the output notes that the rules correctly specify 8,004 of the 8,124 mushroom samples, or nearly 99 percent. We can obtain additional details about the classifier using the summary() function, as shown in the following example:

> summary(mushroom_1R)

=== Summary ===
Correctly Classified Instances        8004  98.5229 %
Incorrectly Classified Instances       120  1.4771 %
Kappa statistic                          0.9704
Mean absolute error                      0.0148
Root mean squared error                  0.1215
Relative absolute error                  2.958  %
Root relative squared error             24.323  %
Coverage of cases (0.95 level)          98.5229 %
Mean rel. region size (0.95 level)      50      %
Total Number of Instances             8124     

=== Confusion Matrix ===
    a    b   <-- classified as
 4208    0 |    a = edible
  120 3796 |    b = poisonous

The section labeled Summary lists a number of different ways to measure the performance of our 1R classifier. We will cover many of these statistics later on in Chapter 10, Evaluating Model Performance, so we will ignore them for now.

The section labeled Confusion Matrix is similar to those used before. Here, we can see where our rules went wrong. The columns in the table indicate the true class of the mushroom while the rows in the table indicate the predicted values. The key is displayed on the right, with a = edible and b = poisonous. The 120 values in the lower-left corner indicate mushrooms that are actually edible but were classified as poisonous. On the other hand, there were zero mushrooms that were poisonous but erroneously classified as edible.

Based on this information, it seems that our 1R rule actually plays it safe—if you avoid unappetizing smells when foraging for mushrooms, you will avoid eating any poisonous mushrooms. However, you might pass up some mushrooms that are actually edible. Considering that the learner utilized only a single feature, we did quite well; the publisher of the next field guide to mushrooms should be very happy. Still, let's see if we can add a few more rules and develop an even better classifier.

Step 5 – improving model performance

For a more sophisticated rule learner, we will use JRip(), a Java-based implementation of the RIPPER rule learning algorithm. As with the 1R implementation we used previously, JRip() is included in the RWeka package. If you have not done so yet, be sure to load the package using the library(RWeka) command.

Step 5 – improving model performance

As shown in the syntax box, the process of training a JRip() model is very similar to how we previously trained a OneR() model. This is one of the pleasant benefits of the functions in the RWeka package; the syntax is consistent across algorithms, which makes the process of comparing a number of different models very simple.

Let's train the JRip() rule learner as we had done with OneR(), allowing it to choose rules from all available features:

> mushroom_JRip <- JRip(type ~ ., data = mushrooms)

To examine the rules, type the name of the classifier:

> mushroom_JRip

JRIP rules:
===========
(odor = foul) => type=poisonous (2160.0/0.0)
(gill_size = narrow) and (gill_color = buff) => type=poisonous (1152.0/0.0)
(gill_size = narrow) and (odor = pungent) => type=poisonous (256.0/0.0)
(odor = creosote) => type=poisonous (192.0/0.0)
(spore_print_color = green) => type=poisonous (72.0/0.0)
(stalk_surface_below_ring = scaly) and (stalk_surface_above_ring = silky) => type=poisonous (68.0/0.0)
(habitat = leaves) and (cap_color = white) => type=poisonous (8.0/0.0)
(stalk_color_above_ring = yellow) => type=poisonous (8.0/0.0)
 => type=edible (4208.0/0.0)
Number of Rules : 9

The JRip() classifier learned a total of nine rules from the mushroom data. An easy way to read these rules is to think of them as a list of if-else statements similar to programming logic. The first three rules could be expressed as:

  • If the odor is foul, then the mushroom type is poisonous
  • If the gill size is narrow and the gill color is buff, then the mushroom type is poisonous
  • If the gill size is narrow and the odor is pungent, then the mushroom type is poisonous

Finally, the ninth rule implies that any mushroom sample that was not covered by the preceding eight rules is edible. Following the example of our programming logic, this can be read as:

  • Else, the mushroom is edible

The numbers next to each rule indicate the number of instances covered by the rule and a count of misclassified instances. Notably, there were no misclassified mushroom samples using these nine rules. As a result, the number of instances covered by the last rule is exactly equal to the number of edible mushrooms in the data (N = 4,208).

The following figure provides a rough illustration of how the rules are applied to the mushroom data. If you imagine everything within the oval as all species of mushroom, the rule learner identified features, or sets of features, which create homogeneous segments within the larger group. First, the algorithm found a large group of poisonous mushrooms uniquely distinguished by their foul odor. Next, it found smaller and more specific groups of poisonous mushrooms. By identifying covering rules for each of the varieties of poisonous mushrooms, all of the remaining mushrooms were edible. Thanks to Mother Nature, each variety of mushrooms was unique enough that the classifier was able to achieve 100 percent accuracy.

Step 5 – improving model performance

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset