We'll put logistic regression for the binary classification task to the test with a real-world data set from the UCI Machine Learning Repository. This time, we will be working with the Statlog (Heart) data set, which we will refer to as the heart data set henceforth for brevity. The data set can be downloaded from the UCI Machine Repository's website at http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29. The data contain 270 observations for patients with potential heart problems. Of these, 120 patients were shown to have heart problems, so the split between the two classes is fairly even. The task is to predict whether a patient has a heart disease based on their profile and a series of medical tests. First, we'll load the data into a data frame and rename the columns according to the website:
> heart <- read.table("heart.dat", quote = """) > names(heart) <- c("AGE", "SEX", "CHESTPAIN", "RESTBP", "CHOL", "SUGAR", "ECG", "MAXHR", "ANGINA", "DEP", "EXERCISE", "FLUOR", "THAL", "OUTPUT")
The following table contains the definitions of our input features and the output:
Column name |
Type |
Definition |
---|---|---|
|
Numerical |
Age (years) |
|
Binary |
Gender |
|
Categorical |
4-valued chest pain type |
|
Numerical |
Resting blood pressure (beats per minute) |
|
Numerical |
Serum cholesterol (mg/dl) |
|
Binary |
Is the fasting blood sugar level > 120 mg/dl? |
|
Categorical |
3-valued resting electrocardiographic results |
|
Numerical |
Maximum heart rate achieved (beats per minute) |
|
Binary |
Was angina induced by exercise? |
|
Numerical |
ST depression induced by exercise relative to rest |
|
Ordered categorical |
Slope of the peak exercise ST segment |
|
Numerical |
The number of major vessels colored by fluoroscopy |
|
Categorical |
3-valued Thal |
|
Binary |
Presence or absence of a heart disease |
Before we train a logistic regression model for these data, there are a couple of preprocessing steps that we should perform. A common pitfall when working with numerical data is the failure to notice when a feature is actually a categorical variable and not a numerical variable when the levels are coded as numbers. In the heart data set, we have four such features. The CHESTPAIN
, THAL
, and ECG
features are all categorical features. The EXERCISE
variable, although an ordered categorical variable, is nonetheless a categorical variable, so it will have to be coded as a factor as well:
> heart$CHESTPAIN = factor(heart$CHESTPAIN) > heart$ECG = factor(heart$ECG) > heart$THAL = factor(heart$THAL) > heart$EXERCISE = factor(heart$EXERCISE)
In Chapter 1, Gearing Up for Predictive Modeling, we saw how we can transform categorical features with many levels into a series of binary valued indicator variables. By doing this, we can use them in a model such as linear or logistic regression, which requires all the inputs to be numerical. As long as the relevant categorical variables in a data frame have been coded as factors, R will automatically apply a coding scheme when performing logistic regression. Concretely, R will treat one of the k factor levels as a reference level and create k-1 binary features from the other factor levels. We'll see visual evidence of this when we study the summary output of the logistic regression model that we'll train.
Next, we should observe that the OUTPUT
variable is coded so that class 1 corresponds to the absence of heart disease and class 2 corresponds to the presence of heart disease. As a final change, we'll want to recode the OUTPUT
variable so that we will have the familiar class labels of 0 and 1, respectively. This is done by simply subtracting 1:
> heart$OUTPUT = heart$OUTPUT - 1
Our data frame is now ready. Before we train our model, however, we will split our data frame into two parts, for training and testing, exactly as we did for linear regression. Once again, we'll use an 85-15 split:
> library(caret) > set.seed(987954) > heart_sampling_vector <- createDataPartition(heart$OUTPUT, p = 0.85, list = FALSE) > heart_train <- heart[heart_sampling_vector,] > heart_train_labels <- heart$OUTPUT[heart_sampling_vector] > heart_test <- heart[-heart_sampling_vector,] > heart_test_labels <- heart$OUTPUT[-heart_sampling_vector]
We now have 230 observations in our training set and 40 observations in our test set. To train a logistic regression model in R, we use the glm()
function, which stands for generalized linear model. This function can be used to train various generalized linear models, but we'll focus on the syntax and usage for logistic regression here. The call is as follows:
> heart_model <- glm(OUTPUT ~ ., data = heart_train, family = binomial("logit"))
Note that the format is very similar to what we saw with linear regression. The first parameter is the model formula, which identifies the output variable and which features we want to use (in this case, all of them). The second parameter is the data frame and the final family
parameter is used to specify that we want to perform logistic regression. We can use the summary()
function to find out more about the model we just trained, as follows:
> summary(heart_model) Call: glm(formula = OUTPUT ~ ., family = binomial("logit"), data = heart_train) Deviance Residuals: Min 1Q Median 3Q Max -2.7137 -0.4421 -0.1382 0.3588 2.8118 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -7.946051 3.477686 -2.285 0.022321 * AGE -0.020538 0.029580 -0.694 0.487482 SEX 1.641327 0.656291 2.501 0.012387 * CHESTPAIN2 1.308530 1.000913 1.307 0.191098 CHESTPAIN3 0.560233 0.865114 0.648 0.517255 CHESTPAIN4 2.356442 0.820521 2.872 0.004080 ** RESTBP 0.026588 0.013357 1.991 0.046529 * CHOL 0.008105 0.004790 1.692 0.090593 . SUGAR -1.263606 0.732414 -1.725 0.084480 . ECG1 1.352751 3.287293 0.412 0.680699 ECG2 0.563430 0.461872 1.220 0.222509 MAXHR -0.013585 0.012873 -1.055 0.291283 ANGINA 0.999906 0.525996 1.901 0.057305 . DEP 0.196349 0.282891 0.694 0.487632 EXERCISE2 0.743530 0.560700 1.326 0.184815 EXERCISE3 0.946718 1.165567 0.812 0.416655 FLUOR 1.310240 0.308348 4.249 2.15e-05 *** THAL6 0.304117 0.995464 0.306 0.759983 THAL7 1.717886 0.510986 3.362 0.000774 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 315.90 on 229 degrees of freedom Residual deviance: 140.36 on 211 degrees of freedom AIC: 178.36 Number of Fisher Scoring iterations: 6