Example – estimating the quality of wines with regression trees and model trees

Winemaking is a challenging and competitive business that offers the potential for great profit. However, there are numerous factors that contribute to the profitability of a winery. As an agricultural product, variables as diverse as the weather and the growing environment impact the quality of a varietal. The bottling and manufacturing can also affect the flavor for better or worse. Even the way the product is marketed, from the bottle design to the price point, can affect the customer's perception of taste.

As a consequence, the winemaking industry has invested heavily in data collection and machine learning methods that may assist with the decision science of winemaking. For example, machine learning has been used to discover key differences in the chemical composition of wines from different regions, and to identify the chemical factors that lead a wine to taste sweeter.

More recently, machine learning has been employed to assist with rating the quality of wine—a notoriously difficult task. A review written by a renowned wine critic often determines whether the product ends up on the top or bottom shelf, in spite of the fact that even expert judges are inconsistent when rating a wine in a blinded test.

In this case study, we will use regression trees and model trees to create a system capable of mimicking expert ratings of wine. Because trees result in a model that is readily understood, this could allow winemakers to identify key factors that contribute to better-rated wines. Perhaps more importantly, the system does not suffer from the human elements of tasting, such as the rater's mood or palate fatigue. Computer-aided wine testing may therefore result in a better product as well as more objective, consistent, and fair ratings.

Step 1 – collecting data

To develop the wine rating model, we will use data donated to the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) by P. Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Their dataset includes examples of red and white Vinho Verde wines from Portugal—one of the world's leading wine-producing countries. Because the factors that contribute to a highly rated wine may differ between the red and white varieties, for this analysis we will examine only the more popular white wines.

Tip

To follow along with this example, download the whitewines.csv file from the Packt Publishing website and save it to your R working directory. The redwines.csv file is also available in case you would like to explore this data on your own.

The white wine data includes information on 11 chemical properties of 4,898 wine samples. For each wine, a laboratory analysis measured characteristics such as the acidity, sugar content, chlorides, sulfur, alcohol, pH, and density. The samples were then rated in a blind tasting by panels of no less than three judges on a quality scale ranging from zero (very bad) to 10 (excellent). In the case that the judges disagreed on the rating, the median value was used.

The study by Cortez evaluated the ability of three machine learning approaches to model the wine data: multiple regression, artificial neural networks, and support vector machines. We covered multiple regression earlier in this chapter, and we will learn about neural networks and support vector machines in Chapter 7, Black Box Methods – Neural Networks and Support Vector Machines. The study found that the support vector machine offered significantly better results than the linear regression model. However, unlike regression, the support vector machine model is difficult to interpret. Using regression trees and model trees, we may be able to improve the regression results while still having a model that is easy to understand.

Note

To read more about the wine study described here, please refer to Modeling wine preferences by data mining from physicochemical properties, Cortez, P, Cerdeira, A, Almeida, F, Matos, T, and Reis, J, Decision Support Systems, 2009, Vol. 47, pp. 547-553.

Step 2 – exploring and preparing the data

As usual, we will use the read.csv() function to load the data into R. Since all of the features are numeric, we can safely ignore the stringsAsFactors parameter:

> wine <- read.csv("whitewines.csv")

The wine data includes 11 features and the quality outcome, as follows:

> str(wine)
'data.frame':    4898 obs. of  12 variables:
 $ fixed.acidity       : num  6.7 5.7 5.9 5.3 6.4 7 7.9 ...
 $ volatile.acidity    : num  0.62 0.22 0.19 0.47 0.29 0.12 ...
 $ citric.acid         : num  0.24 0.2 0.26 0.1 0.21 0.41 ...
 $ residual.sugar      : num  1.1 16 7.4 1.3 9.65 0.9 ...
 $ chlorides           : num  0.039 0.044 0.034 0.036 0.041 ...
 $ free.sulfur.dioxide : num  6 41 33 11 36 22 33 17 34 40 ...
 $ total.sulfur.dioxide: num  62 113 123 74 119 95 152 ...
 $ density             : num  0.993 0.999 0.995 0.991 0.993 ...
 $ pH                  : num  3.41 3.22 3.49 3.48 2.99 3.25 ...
 $ sulphates           : num  0.32 0.46 0.42 0.54 0.34 0.43 ...
 $ alcohol             : num  10.4 8.9 10.1 11.2 10.9 ...
 $ quality             : int  5 6 6 4 6 6 6 6 6 7 ...

Compared with other types of machine learning models, one of the advantages of trees is that they can handle many types of data without preprocessing. This means we do not need to normalize or standardize the features.

However, a bit of effort to examine the distribution of the outcome variable is needed to inform our evaluation of the model's performance. For instance, suppose that there was very little variation in quality from wine to wine, or that wines fell into a bimodal distribution: either very good or very bad. This may impact the way we design the model. To check for such extremes, we can examine the distribution of wine quality using a histogram:

> hist(wine$quality)

This produces the following figure:

Step 2 – exploring and preparing the data

Figure 6.11: The distribution of the quality ratings of white wines

The wine quality values appear to follow a fairly normal, bell-shaped distribution, centered around a value of six. This makes sense intuitively, because most wines are of average quality; few are particularly bad or good. Although the results are not shown here, it is also useful to examine the summary(wine) output for outliers or other potential data problems. Even though trees are fairly robust to messy data, it is always prudent to check for severe problems. For now, we'll assume that the data is reliable.

Our last step, then, is to divide the dataset into training and testing sets. Since the wine dataset was already sorted randomly, we can partition into two sets of contiguous rows as follows:

> wine_train <- wine[1:3750, ]
> wine_test <- wine[3751:4898, ]

In order to mirror the conditions used by Cortez, we used sets of 75 percent and 25 percent for training and testing, respectively. We'll evaluate the performance of our tree-based models on the testing data to see if we can obtain results comparable to the prior research study.

Step 3 – training a model on the data

We will begin by training a regression tree model. Although almost any implementation of decision trees can be used to perform regression tree modeling, the rpart (recursive partitioning) package offers the most faithful implementation of regression trees as they were described by the CART team. As the classic R implementation of CART, the rpart package is also well-documented and supported with functions for visualizing and evaluating the rpart models.

Install the rpart package using the install.packages("rpart") command. It can then be loaded into your R session using the library(rpart) statement. The following syntax will train a tree using the default settings, which typically work fairly well. If you need more finely-tuned settings, refer to the documentation for the control parameters using the ?rpart.control command.

Step 3 – training a model on the data

Using the R formula interface, we can specify quality as the outcome variable and use the dot notation to allow all other columns in the wine_train data frame to be used as predictors. The resulting regression tree model object is named m.rpart to distinguish it from the model tree we will train later:

> m.rpart <- rpart(quality ~ ., data = wine_train)

For basic information about the tree, simply type the name of the model object:

> m.rpart
n= 3750

node), split, n, deviance, yval
      * denotes terminal node

 1) root 3750 2945.53200 5.870933  
   2) alcohol< 10.85 2372 1418.86100 5.604975  
     4) volatile.acidity>=0.2275 1611  821.30730 5.432030  
       8) volatile.acidity>=0.3025 688  278.97670 5.255814 *
       9) volatile.acidity< 0.3025 923  505.04230 5.563380 *
     5) volatile.acidity< 0.2275 761  447.36400 5.971091 *
   3) alcohol>=10.85 1378 1070.08200 6.328737  
     6) free.sulfur.dioxide< 10.5 84   95.55952 5.369048 *
     7) free.sulfur.dioxide>=10.5 1294  892.13600 6.391036  
      14) alcohol< 11.76667 629  430.11130 6.173291  
        28) volatile.acidity>=0.465 11   10.72727 4.545455 *
        29) volatile.acidity< 0.465 618  389.71680 6.202265 *
      15) alcohol>=11.76667 665  403.99400 6.596992 *

For each node in the tree, the number of examples reaching the decision point is listed. For instance, all 3,750 examples begin at the root node, of which 2,372 have alcohol < 10.85 and 1,378 have alcohol >= 10.85. Because alcohol was used first in the tree, it is the single most important predictor of wine quality.

Nodes indicated by * are terminal or leaf nodes, which means that they result in a prediction (listed here as yval). For example, node 5 has a yval of 5.971091. When the tree is used for predictions, any wine samples with alcohol < 10.85 and volatile.acidity < 0.2275 would therefore be predicted to have a quality value of 5.97.

A more detailed summary of the tree's fit, including the mean squared error for each of the nodes and an overall measure of feature importance, can be obtained using the command summary(m.rpart).

Visualizing decision trees

Although the tree can be understood using only the preceding output, it is often more readily understood using visualization. The rpart.plot package by Stephen Milborrow provides an easy-to-use function that produces publication-quality decision trees.

Note

For more information on rpart.plot, including additional examples of the types of decision tree diagrams the function can produce, refer to the author's website at http://www.milbo.org/rpart-plot/.

After installing the package using the install.packages("rpart.plot") command, the rpart.plot() function produces a tree diagram from any rpart model object. The following commands plot the regression tree we built earlier:

> library(rpart.plot)
> rpart.plot(m.rpart, digits = 3)

The resulting tree diagram is as follows:

Visualizing decision trees

Figure 6.12: A visualization of the wine quality regression tree model

In addition to the digits parameter that controls the number of numeric digits to include in the diagram, many other aspects of the visualization can be adjusted. The following command shows just a few of the useful options:

> rpart.plot(m.rpart, digits = 4, fallen.leaves = TRUE,
               type = 3, extra = 101)

The fallen.leaves parameter forces the leaf nodes to be aligned at the bottom of the plot, while the type and extra parameters affect the way the decisions and nodes are labeled. The numbers 3 and 101 refer to specific style formats, which can be found in the command's documentation, or via experimentation with various numbers.

The result of these changes is a very different looking tree diagram:

Visualizing decision trees

Figure 6.13: Changing the plot function parameters allows customization of the tree visualization

Visualizations like these may assist with the dissemination of regression tree results, as they are readily understood even without a mathematics background. In both cases, the numbers shown in the leaf nodes are the predicted values for the examples reaching that node. Showing the diagram to the wine producers may thus help to identify the key factors that predict the higher-rated wines.

Step 4 – evaluating model performance

To use the regression tree model to make predictions on the test data, we use the predict() function. By default, this returns the estimated numeric value for the outcome variable, which we'll save in a vector named p.rpart:

> p.rpart <- predict(m.rpart, wine_test)

A quick look at the summary statistics of our predictions suggests a potential problem: the predictions fall on a much narrower range than the true values:

> summary(p.rpart)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.545   5.563   5.971   5.893   6.202   6.597
> summary(wine_test$quality)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  3.000   5.000   6.000   5.901   6.000   9.000

This finding suggests that the model is not correctly identifying the extreme cases, in particular, the best and worst wines. On the other hand, between the first and third quartile, we may be doing well.

The correlation between the predicted and actual quality values provides a simple way to gauge the model's performance. Recall that the cor() function can be used to measure the relationship between two equal-length vectors. We'll use this to compare how well the predicted values correspond to the true values:

> cor(p.rpart, wine_test$quality)
[1] 0.5369525

A correlation of 0.54 is certainly acceptable. However, the correlation only measures how strongly the predictions are related to the true value; it is not a measure of how far off the predictions were from the true values.

Measuring performance with the mean absolute error

Another way to think about the model's performance is to consider how far, on average, its prediction was from the true value. This measurement is called the mean absolute error (MAE). The equation for MAE is as follows, where n indicates the number of predictions and ei indicates the error for prediction i:

Measuring performance with the mean absolute error

As the name implies, this equation takes the mean of the absolute value of the errors. Since the error is just the difference between the predicted and actual values, we can create a simple MAE() function as follows:

> MAE <- function(actual, predicted) {
    mean(abs(actual - predicted))
}

The MAE for our predictions is then:

> MAE(p.rpart, wine_test$quality)
[1] 0.5872652

This implies that, on average, the difference between our model's predictions and the true quality score was about 0.59. On a quality scale from zero to 10, this seems to suggest that our model is doing fairly well.

On the other hand, recall that most wines were neither very good nor very bad; the typical quality score was around five to six. Therefore, a classifier that did nothing but predict the mean value may still do fairly well according to this metric.

The mean quality rating in the training data is as follows:

> mean(wine_train$quality)
[1] 5.870933

If we predicted the value 5.87 for every wine sample, we would have a mean absolute error of only about 0.67:

> MAE(5.87, wine_test$quality)
[1] 0.6722474

Our regression tree (MAE = 0.59) comes closer on average to the true quality score than the imputed mean (MAE = 0.67), but not by much. In comparison, Cortez reported an MAE of 0.58 for the neural network model and an MAE of 0.45 for the support vector machine. This suggests that there is room for improvement.

Step 5 – improving model performance

To improve the performance of our learner, let's apply a model tree algorithm, which is a more complex application of trees to numeric prediction. Recall that a model tree extends regression trees by replacing the leaf nodes with regression models. This often results in more accurate results than regression trees, which use only a single numeric value for the prediction at the leaf nodes.

The current state-of-the-art in model trees is the Cubist algorithm, which itself is an enhancement of the M5 model tree algorithm—both of which were published by J.R. Quinlan in the early 1990s. Though the implementation details are beyond the scope of this book, the Cubist algorithm involves building a decision tree, creating decision rules based on the branches of the tree, and building a regression model at each of the leaf nodes. Additional heuristics, such as pruning and boosting, are used to improve the quality of the predictions and smoothness across the range of predicted values.

Note

For more background on the Cubist and M5 algorithms, see Learning With Continuous Classes, Quinlan, JR, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, 1992; pp. 343-348. Additionally, see Combining Instance-Based and Model-Based Learning, Quinlan, JR, Proceedings of the Tenth International Conference on Machine Learning, 1993, pp. 236-243.

The Cubist algorithm is available in R via the Cubist package and the associated cubist() function. The syntax of this function is shown in the following table:

Step 5 – improving model performance

We'll fit the Cubist model tree using a slightly different syntax from what was used for the regression tree, as the cubist() function does not accept the R formula syntax. Instead, we must specify the data frame columns used for the x independent variables and the y dependent variable. With the wine quality to be predicted residing in column 12, and using all other columns as predictors, the full command is as follows:

> library(Cubist)
> m.cubist <- cubist(x = wine_train[-12], y = wine_train$quality)

Basic information about the model tree can be examined by typing its name:

> m.cubist

Call:
cubist.default(x = wine_train[-12], y = wine_train$quality)

Number of samples: 3750
Number of predictors: 11

Number of committees: 1
Number of rules: 25

In this output, we see that the algorithm generated 25 rules to model the wine quality. To examine some of these rules, we can apply the summary() function to the model object. Since the complete tree is very large, only the first few lines of output depicting the first decision rule are included here:

> summary(m.cubist)

  Rule 1: [21 cases, mean 5.0, range 4 to 6, est err 0.5]

   if
        free.sulfur.dioxide > 30
        total.sulfur.dioxide > 195
        total.sulfur.dioxide <= 235
        sulphates > 0.64
        alcohol > 9.1
   then
        outcome = 573.6 + 0.0478 total.sulfur.dioxide
                  - 573 density - 0.788 alcohol
                  + 0.186 residual.sugar - 4.73 volatile.acidity

You will note that the if portion of the output is somewhat similar to the regression tree we built earlier. A series of decisions based on the wine properties of sulfur dioxide, sulphates, and alcohol creates a rule culminating in the final prediction. A key difference between this model tree output and the earlier regression tree output, however, is that the nodes here terminate not in a numeric prediction, but rather a linear model.

The linear model for this rule is shown in the then output following the outcome = statement. The numbers can be interpreted exactly the same as the multiple regression models we built earlier in this chapter. Each value is the estimated beta of the associated feature, that is, the net effect of that feature on the predicted wine quality. For example, the coefficient of 0.186 for residual sugar implies that for an increase of 1 unit of residual sugar, the wine quality rating is expected to increase by 0.186.

It is important to note that the regression effects estimated by this model apply only to wine samples reaching this node; an examination of the entirety of the Cubist output reveals that a total of 25 linear models were built in this model tree, one for each decision rule, and each with different parameter estimates of the impact of residual sugar and the 10 other features.

To examine the performance of this model, we'll look at how well it performs on the unseen test data. The predict() function gets us a vector of predicted values:

> p.cubist <- predict(m.cubist, wine_test)

The model tree appears to be predicting a wider range of values than the regression tree:

> summary(p.cubist)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  3.677   5.416   5.906   5.848   6.238   7.393

The correlation also seems to be substantially higher:

> cor(p.cubist, wine_test$quality)
[1] 0.6201015

Furthermore, the model slightly reduced the mean absolute error:

> MAE(wine_test$quality, p.cubist)
[1] 0.5339725

Although we did not improve a great deal beyond the regression tree, we surpassed the performance of the neural network model published by Cortez, and we are getting closer to the published mean absolute error value of 0.45 for the support vector machine model, all while using a much simpler learning method.

Tip

Not surprisingly, we have confirmed that predicting the quality of wines is a difficult problem; wine tasting, after all, is inherently subjective. If you would like additional practice, you may try revisiting this problem after reading Chapter 11, Improving Model Performance, which covers additional techniques that may lead to better results.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset