Chapter 15. Distributions and Modeling

Using summary statistics and plots to understand data is great, but they have their limitations. Statistics don’t give you the shape of data, and plots aren’t scalable to many variables (with more than five or six, things start to get confusing),[58] nor are they scalable in number (since you have to physically look at each one). Neither statistics nor plots are very good at giving you predictions from the data.

This is where models come in: if you understand enough about the structure of the data to be able to run a suitable model, then you can pass quantitative judgments about the data and make predictions.

There are lots of different statistical models, with more being invented as fast as university statistics departments can think of them. In order to avoid turning into a statistics course, this chapter is just going to deal with some really simple regression models. If you want to learn some stats, I recommend The R Book or Discovering Statistics Using R, both of which explain statistical concepts in glorious slow motion.

Before we get to running any models, we need a bit of background on generating random numbers, different kinds of distributions, and formulae.

Chapter Goals

After reading this chapter, you should:

  • Be able to generate random numbers from many distributions
  • Be able to find quantiles and inverse quantiles from those distributions
  • Know how to write a model formula
  • Understand how to run, update, and plot a linear regression

Random Numbers

Random numbers are critical to many analyses, and consequently R has a wide variety of functions for sampling from different distributions.

The sample Function

We’ve seen the sample function a few times already (it was first introduced in Chapter 3). It’s an important workhorse function for generating random values, and its behavior has a few quirks, so it’s worth getting to know it more thoroughly. If you just pass it a number, n, it will return a permutation of the natural numbers from 1 to n:

sample(7)
## [1] 1 2 5 7 4 6 3

If you give it a second value, it will return that many random numbers between 1 and n:

sample(7, 5)
## [1] 7 2 3 1 5

Notice that all those random numbers are different. By default, sample samples without replacement. That is, each value can only appear once. To allow sampling with replacement, pass replace = TRUE:

sample(7, 10, replace = TRUE)
##  [1] 4 6 1 7 5 3 6 7 4 2

Of course, returning natural numbers isn’t that interesting most of the time, but sample is flexible enough to let us sample from any vector that we like. Most commonly, we might want to pass it a character vector:

sample(colors(), 5) #a great way to pick the color scheme for your house
## [1] "grey53"       "deepskyblue2" "gray94"       "maroon2"
## [5] "gray18"

If we were feeling more adventurous, we could pass it some dates:

sample(.leap.seconds, 4)
## [1] "2012-07-01 01:00:00 BST" "1994-07-01 01:00:00 BST"
## [3] "1981-07-01 01:00:00 BST" "1990-01-01 00:00:00 GMT"

We can also weight the probability of each input value being returned by passing a prob argument. In the next example, we use R to randomly decide which month to go on holiday, then fudge it to increase our chances of a summer break:

weights <- c(1, 1, 2, 3, 5, 8, 13, 21, 8, 3, 1, 1)
sample(month.abb, 1, prob = weights)
## [1] "Jul"

Sampling from Distributions

Oftentimes, we want to generate random numbers from a probability distribution. R has functions available for sampling from almost one hundred distributions, mixtures, and copulas across the various packages. The ?Distribution help page documents the facilities available in base R, and the CRAN Task View on distributions gives extensive directions about which package to look in if you want something more esoteric.

Most of the random number generation functions have the name r<distn>. For example, we’ve already seen runif, which generates uniform random numbers, and rnorm, which generates normally distributed random numbers. The first parameter for each of these functions is the number of random numbers to generate, and further parameters affect the shape of the distribution. For example, runif allows you to set the lower and upper bounds of the distribution:

runif(5)         #5 uniform random numbers between 0 and 1
runif(5, 1,10)   #5 uniform random numbers between 1 and 10
rnorm(5)         #5 normal random numbers with mean 0 and std dev 1
rnorm(5, 3, 7)   #5 normal random numbers with mean 3 and std dev 7

Random numbers generated by R are, like with any other software, actually pseudorandom. That is, they are generated by an algorithm rather than a genuinely random process. R supports several algorithms out of the box, as described on the ?RNG page, and more specialist algorithms (for parallel number generation, for example) in other packages. The CRAN Task View on distributions mentioned above also describes where to find more algorithms. You can see which algorithms are used for uniform and normal random number generation with the RNGkind function (sampling from other distributions typically uses some function of uniform or normal random numbers):

RNGkind()
## [1] "Mersenne-Twister" "Inversion"

Random number generators require a starting point to generate numbers from, known as a “seed.” By setting the seed to a particular value, you can guarantee that the same random numbers will be generated each time you run the same piece of code. For example, this book fixes the seed so that the examples are the same each time the book is built. You can set the seed with the set.seed function. It takes a positive integer as an input. Which integer you choose doesn’t matter; different seeds just give different random numbers:

set.seed(1)
runif(5)
## [1] 0.2655 0.3721 0.5729 0.9082 0.2017
set.seed(1)
runif(5)
## [1] 0.2655 0.3721 0.5729 0.9082 0.2017
set.seed(1)
runif(5)
## [1] 0.2655 0.3721 0.5729 0.9082 0.2017

You can also specify different generation algorithms, though this is very advanced usage, so don’t do that unless you know what you are doing.

Distributions

As well as a function for generating random numbers, most distributions also have functions for calculating their probability density function (PDF), cumulative density function (CDF), and inverse CDF.

Like the RNG functions with names beginning with r, these functions have names beginning with d, p, and q, respectively. For example, the normal distribution has functions dnorm, pnorm, and qnorm. These functions are perhaps best demonstrated visually, and are shown in Figure 15-1. (The code is omitted, on account of being rather fiddly.)

PDF

Figure 15-1. PDF, CDF, and inverse CDF of a normal distribution

Formulae

We’ve already seen the use of formulae in lattice plots and ggplot2 facets. Many statistical models use formulae in a similar way, in order to specify the structure of the variables in the model. The exact meaning of the formula depends upon the model function that it gets passed to (in the same way that a formula in a lattice plot means something different to a formula in a ggplot2 facet), but there is a common pattern that most models satisfy: the lefthand side specifies the response variable, the righthand side specifies independent variables, and the two are separated by a tilde. Returning to the gonorrhoea dataset as an example, we have:

Rate ~ Year + Age.Group + Ethnicity + Gender

Here, Rate is the response, and we have chosen to include four independent variables (year/age group/ethnicity/gender). For models that can include an intercept (that’s more or less all regression models), a formula like this will implicitly include that intercept. If we passed it into a linear regression model, it would mean:

Formulae

where each is a constant to be determined by the model, and is a normally distributed error term.

If we didn’t want the intercept term, we could add a zero to the righthand side to suppress it:

Rate ~ 0 + Year + Age.Group + Ethnicity + Gender

The updated equation given by this formula is:

Formulae

Each of these formulae includes only the individual independent variables, without any interactions between them. To include interactions, we can replace the plusses with asterisks:

Rate ~ Year * Age.Group * Ethnicity * Gender

This adds in all possible two-way interactions (year and age group, year and ethnicity, etc.), and three-way interactions (year and age group and ethnicity, etc.), all the way up to an every-way interaction between all independent variables (year and age group and ethnicity and gender). This is very often overkill, since a every-way interaction is usually meaningless.

There are two ways of restricting the level of interaction. First, you can add individual interactions using colons. In the following example, Year:Ethnicity is a two-way interaction between those terms, and Year:Ethnicity:Gender is a three-way interaction:

Rate ~ Year + Ethnicity + Gender + Year:Ethnicity + Year:Ethnicity:Gender

This fine-grained approach can become cumbersome to type if you have more than three or four variables, though, so an alternate syntax lets you include all interactions up to a certain level using carets (^). The next example includes the year, ethnicity, and gender, and the three two-way interactions:

Rate ~ (Year + Ethnicity + Gender) ^ 2

You can also include modified versions of variables. For example, a surprising number of environmental processes generate lognormally distributed variables, which you may want to include in a linear regression as log(var). Terms like this can be included directly, but you may have spotted a problem with including var ^ 2. That syntax is reserved for interactions, so if you want to include powers of things, wrap them in I():

Rate ~ I(Year ^ 2)  #year squared, not an interaction

A First Model: Linear Regressions

Ordinary least squares linear regressions are the simplest in an extensive family of regression models. The function for calculating them is concisely if not clearly named: lm, short for “linear model.” It accepts a formula of the type we’ve just discussed, and a data frame that contains the variables to model. Let’s take a look at the gonorrhoea dataset. For simplicity, we’ll ignore interactions:

model1 <- lm(Rate ~ Year + Age.Group + Ethnicity + Gender, gonorrhoea)

If we print the model variable, it lists the coefficients for each input variable (the values). If you look closely, you’ll notice that for both of the categorical variables that we put into the model (age group and ethnicity), one category has no coefficient. For example, the 0 to 4 age group is missing, and the American Indians & Alaskan Natives ethnicity is also missing.

These “missing” categories are included in the intercept. In the following output, the intercept value of 5,540 people infected with gonorrhoea per 100,000 people applies to 0- to 4-year-old female American Indians and Alaskan Natives in the year 0. To predict infection rates in the years up to 2013, we would add 2013 times the coefficient for Year, -2.77. To predict the effect on 25- to 29-year-olds of the same ethnicity, we would add the coefficient for that age group, 291:

model1
##
## Call:
## lm(formula = Rate ~ Year + Age.Group + Ethnicity + Gender, data = gonorrhoea)
##
## Coefficients:
##                         (Intercept)                                 Year
##                            5540.496                               -2.770
##                     Age.Group5 to 9                    Age.Group10 to 14
##                              -0.614                               15.268
##                   Age.Group15 to 19                    Age.Group20 to 24
##                             415.698                              546.820
##                   Age.Group25 to 29                    Age.Group30 to 34
##                             291.098                              155.872
##                   Age.Group35 to 39                    Age.Group40 to 44
##                              84.612                               49.506
##                   Age.Group45 to 54                    Age.Group55 to 64
##                              27.364                                8.684
##                 Age.Group65 or more  EthnicityAsians & Pacific Islanders
##                               1.178                              -82.923
##                  EthnicityHispanics         EthnicityNon-Hispanic Blacks
##                             -49.000                              376.204
##        EthnicityNon-Hispanic Whites                           GenderMale
##                             -68.263                              -17.892

The “0- to 4-year-old female American Indians and Alaskan Natives” group was chosen because it consists of the first level of each of the factor variables. We can see those factor levels by looping over the dataset and calling levels:

lapply(Filter(is.factor, gonorrhoea), levels)
## $Age.Group
##  [1] "0 to 4"     "5 to 9"     "10 to 14"   "15 to 19"   "20 to 24"
##  [6] "25 to 29"   "30 to 34"   "35 to 39"   "40 to 44"   "45 to 54"
## [11] "55 to 64"   "65 or more"
##
## $Ethnicity
## [1] "American Indians & Alaskan Natives"
## [2] "Asians & Pacific Islanders"
## [3] "Hispanics"
## [4] "Non-Hispanic Blacks"
## [5] "Non-Hispanic Whites"
##
## $Gender
## [1] "Female" "Male"

As well as knowing the size of the effect of each input variable, we usually want to know which variables were significant. The summary function is overloaded to work with lm to do just that. The most exciting bit of the output from summary is the coefficients table. The Estimate column shows the coefficients that we’ve seen already, and the fourth column, Pr(>|t|), shows the p-values. The fifth column gives a star rating: where the p-value is less than 0.05 a variable gets one star, less than 0.01 is two stars, and so on. This makes it easy to quickly see which variables had a significant effect:

summary(model1)
##
## Call:
## lm(formula = Rate ~ Year + Age.Group + Ethnicity + Gender, data = gonorrhoea)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -376.7 -130.6   37.1   90.7 1467.1
##
## Coefficients:
##                                      Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          5540.496  14866.406    0.37   0.7095
## Year                                   -2.770      7.400   -0.37   0.7083
## Age.Group5 to 9                        -0.614     51.268   -0.01   0.9904
## Age.Group10 to 14                      15.268     51.268    0.30   0.7660
## Age.Group15 to 19                     415.698     51.268    8.11  3.0e-15
## Age.Group20 to 24                     546.820     51.268   10.67  < 2e-16
## Age.Group25 to 29                     291.098     51.268    5.68  2.2e-08
## Age.Group30 to 34                     155.872     51.268    3.04   0.0025
## Age.Group35 to 39                      84.612     51.268    1.65   0.0994
## Age.Group40 to 44                      49.506     51.268    0.97   0.3346
## Age.Group45 to 54                      27.364     51.268    0.53   0.5937
## Age.Group55 to 64                       8.684     51.268    0.17   0.8656
## Age.Group65 or more                     1.178     51.268    0.02   0.9817
## EthnicityAsians & Pacific Islanders   -82.923     33.093   -2.51   0.0125
## EthnicityHispanics                    -49.000     33.093   -1.48   0.1392
## EthnicityNon-Hispanic Blacks          376.204     33.093   11.37  < 2e-16
## EthnicityNon-Hispanic Whites          -68.263     33.093   -2.06   0.0396
## GenderMale                            -17.892     20.930   -0.85   0.3930
##
## (Intercept)
## Year
## Age.Group5 to 9
## Age.Group10 to 14
## Age.Group15 to 19                   ***
## Age.Group20 to 24                   ***
## Age.Group25 to 29                   ***
## Age.Group30 to 34                   **
## Age.Group35 to 39                   .
## Age.Group40 to 44
## Age.Group45 to 54
## Age.Group55 to 64
## Age.Group65 or more
## EthnicityAsians & Pacific Islanders *
## EthnicityHispanics
## EthnicityNon-Hispanic Blacks        ***
## EthnicityNon-Hispanic Whites        *
## GenderMale
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256 on 582 degrees of freedom
## Multiple R-squared:  0.491,  Adjusted R-squared:  0.476
## F-statistic: 33.1 on 17 and 582 DF,  p-value: <2e-16

Comparing and Updating Models

Rather than just accepting the first model that we think of, we often want to find a “best” model, or a small set of models that provide some insight.

This section demonstrates some metrics for measuring the quality of a model, such as p-values and log-likelihood measures. By using these metrics to compare models, you can automatically keep updating your model until you get to a “best” one.

Unfortunately, automatic updating of models like this (“stepwise regression”) is a poor method for choosing a model, and it gets worse as you increase the number of input variables.

Better methods for model selection, like model training or model averaging, are beyond the scope of this book. The CrossValidated statistics Q&A site has a good list of possibilities.

Tip

It is often a mistake to try to find a single “best” model. A good model is one that gives you insight into your problem—there may be several that do this.

In order to sensibly choose a model for our dataset, we need to understand a little about the things that affect gonorrhoea infection rates. Gonorrhoea is primarily sexually transmitted (with some transmission from mothers to babies during childbirth), so the big drivers are related to sexual culture: how much unprotected sex people have, and with how many partners.

The p-value for Year is 0.71, meaning not even close to significant. Over such a short time series (five years of data), I’d be surprised if there were any changes in sexual culture big enough to have an important effect on infection rates, so let’s see what happens when we remove it.

Rather than having to completely respecify the model, we can update it using the update function. This accepts a model and a formula. We are just updating the righthand side of the formula, so the lefthand side stays blank. In the next example, . means “the terms that were already in the formula,” and (minus) means “remove this next term”:

model2 <- update(model1, ~ . - Year)
summary(model2)
##
## Call:
## lm(formula = Rate ~ Age.Group + Ethnicity + Gender, data = gonorrhoea)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -377.6 -128.4   34.6   92.2 1472.6
##
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          -25.103     43.116   -0.58   0.5606
## Age.Group5 to 9                       -0.614     51.230   -0.01   0.9904
## Age.Group10 to 14                     15.268     51.230    0.30   0.7658
## Age.Group15 to 19                    415.698     51.230    8.11  2.9e-15
## Age.Group20 to 24                    546.820     51.230   10.67  < 2e-16
## Age.Group25 to 29                    291.098     51.230    5.68  2.1e-08
## Age.Group30 to 34                    155.872     51.230    3.04   0.0025
## Age.Group35 to 39                     84.612     51.230    1.65   0.0992
## Age.Group40 to 44                     49.506     51.230    0.97   0.3343
## Age.Group45 to 54                     27.364     51.230    0.53   0.5934
## Age.Group55 to 64                      8.684     51.230    0.17   0.8655
## Age.Group65 or more                    1.178     51.230    0.02   0.9817
## EthnicityAsians & Pacific Islanders  -82.923     33.069   -2.51   0.0124
## EthnicityHispanics                   -49.000     33.069   -1.48   0.1389
## EthnicityNon-Hispanic Blacks         376.204     33.069   11.38  < 2e-16
## EthnicityNon-Hispanic Whites         -68.263     33.069   -2.06   0.0394
## GenderMale                           -17.892     20.915   -0.86   0.3926
##
## (Intercept)
## Age.Group5 to 9
## Age.Group10 to 14
## Age.Group15 to 19                   ***
## Age.Group20 to 24                   ***
## Age.Group25 to 29                   ***
## Age.Group30 to 34                   **
## Age.Group35 to 39                   .
## Age.Group40 to 44
## Age.Group45 to 54
## Age.Group55 to 64
## Age.Group65 or more
## EthnicityAsians & Pacific Islanders *
## EthnicityHispanics
## EthnicityNon-Hispanic Blacks        ***
## EthnicityNon-Hispanic Whites        *
## GenderMale
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256 on 583 degrees of freedom
## Multiple R-squared:  0.491,  Adjusted R-squared:  0.477
## F-statistic: 35.2 on 16 and 583 DF,  p-value: <2e-16

The anova function computes ANalysis Of VAriance tables for models, letting you see if your simplified model is significantly different from the fuller model:

anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: Rate ~ Year + Age.Group + Ethnicity + Gender
## Model 2: Rate ~ Age.Group + Ethnicity + Gender
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)
## 1    582 38243062
## 2    583 38252272 -1     -9210 0.14   0.71

The p-value in the righthand column is 0.71, so removing the year term didn’t significantly affect the model’s fit to the data.

The Akaike and Bayesian information criteria provide alternate methods of comparing models, via the AIC and BIC functions. They use log-likelihood values, which tell you how well the models fit the data, and penalize them depending upon how many terms there are in the model (so simpler models are better than complex models). Smaller numbers roughly correspond to “better” models:

AIC(model1, model2)
##        df  AIC
## model1 19 8378
## model2 18 8376
BIC(model1, model2)
##        df  BIC
## model1 19 8462
## model2 18 8456

You can see the effects of these functions better if we create a silly model. Let’s remove age group, which appears to be a powerful predictor of gonorrhoea infection rates (as it should be, since children and the elderly have much less sex than young adults):

silly_model <- update(model1, ~ . - Age.Group)
anova(model1, silly_model)
## Analysis of Variance Table
##
## Model 1: Rate ~ Year + Age.Group + Ethnicity + Gender
## Model 2: Rate ~ Year + Ethnicity + Gender
##   Res.Df      RSS  Df Sum of Sq    F Pr(>F)
## 1    582 38243062
## 2    593 57212506 -11  -1.9e+07 26.2 <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(model1, silly_model)
##             df  AIC
## model1      19 8378
## silly_model  8 8598
BIC(model1, silly_model)
##             df  BIC
## model1      19 8462
## silly_model  8 8633

In the silly model, anova notes a significant difference between the models, and both the AIC and the BIC have gone up by a lot.

Returning to our quest for a non-silly model, notice that gender is nonsignificant (p = 0.39). If you did Exercise 14-3 (you did, right?), then this might be interesting to you because it looked like infection rates were higher in women from a plot. Hold that thought until the exercises at the end of this chapter. For now, let’s trust the p-value and remove the gender term from the model:

model3 <- update(model2, ~ . - Gender)
summary(model3)
##
## Call:
## lm(formula = Rate ~ Age.Group + Ethnicity, data = gonorrhoea)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -380.1 -136.1   35.8   87.4 1481.5
##
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          -34.050     41.820   -0.81   0.4159
## Age.Group5 to 9                       -0.614     51.218   -0.01   0.9904
## Age.Group10 to 14                     15.268     51.218    0.30   0.7657
## Age.Group15 to 19                    415.698     51.218    8.12  2.9e-15
## Age.Group20 to 24                    546.820     51.218   10.68  < 2e-16
## Age.Group25 to 29                    291.098     51.218    5.68  2.1e-08
## Age.Group30 to 34                    155.872     51.218    3.04   0.0024
## Age.Group35 to 39                     84.612     51.218    1.65   0.0991
## Age.Group40 to 44                     49.506     51.218    0.97   0.3342
## Age.Group45 to 54                     27.364     51.218    0.53   0.5934
## Age.Group55 to 64                      8.684     51.218    0.17   0.8654
## Age.Group65 or more                    1.178     51.218    0.02   0.9817
## EthnicityAsians & Pacific Islanders  -82.923     33.061   -2.51   0.0124
## EthnicityHispanics                   -49.000     33.061   -1.48   0.1389
## EthnicityNon-Hispanic Blacks         376.204     33.061   11.38  < 2e-16
## EthnicityNon-Hispanic Whites         -68.263     33.061   -2.06   0.0394
##
## (Intercept)
## Age.Group5 to 9
## Age.Group10 to 14
## Age.Group15 to 19                   ***
## Age.Group20 to 24                   ***
## Age.Group25 to 29                   ***
## Age.Group30 to 34                   **
## Age.Group35 to 39                   .
## Age.Group40 to 44
## Age.Group45 to 54
## Age.Group55 to 64
## Age.Group65 or more
## EthnicityAsians & Pacific Islanders *
## EthnicityHispanics
## EthnicityNon-Hispanic Blacks        ***
## EthnicityNon-Hispanic Whites        *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256 on 584 degrees of freedom
## Multiple R-squared:  0.491,  Adjusted R-squared:  0.477
## F-statistic: 37.5 on 15 and 584 DF,  p-value: <2e-16

Finally, the intercept term looks to be nonsignificant. This is because the default group is 0- to 4-year-old American Indians and Alaskan Natives, and they don’t have much gonorrhoea.

We can set a different default with the relevel function. As a non-Hispanic white person in the 30- to 34-year-old group,[59] I’m going to arbitrarily set those as the defaults. In this next example, notice that we can use the update function to update the data frame as well as the formula:

g2 <- within(
  gonorrhoea,
  {
    Age.Group <- relevel(Age.Group, "30 to 34")
    Ethnicity <- relevel(Ethnicity, "Non-Hispanic Whites")
  }
)
model4 <- update(model3, data = g2)
summary(model4)
##
## Call:
## lm(formula = Rate ~ Age.Group + Ethnicity, data = g2)
##
## Residuals:
##    Min     1Q Median     3Q    Max
## -380.1 -136.1   35.8   87.4 1481.5
##
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                     53.6       41.8    1.28
## Age.Group0 to 4                               -155.9       51.2   -3.04
## Age.Group5 to 9                               -156.5       51.2   -3.06
## Age.Group10 to 14                             -140.6       51.2   -2.75
## Age.Group15 to 19                              259.8       51.2    5.07
## Age.Group20 to 24                              390.9       51.2    7.63
## Age.Group25 to 29                              135.2       51.2    2.64
## Age.Group35 to 39                              -71.3       51.2   -1.39
## Age.Group40 to 44                             -106.4       51.2   -2.08
## Age.Group45 to 54                             -128.5       51.2   -2.51
## Age.Group55 to 64                             -147.2       51.2   -2.87
## Age.Group65 or more                           -154.7       51.2   -3.02
## EthnicityAmerican Indians & Alaskan Natives     68.3       33.1    2.06
## EthnicityAsians & Pacific Islanders            -14.7       33.1   -0.44
## EthnicityHispanics                              19.3       33.1    0.58
## EthnicityNon-Hispanic Blacks                   444.5       33.1   13.44
##                                             Pr(>|t|)
## (Intercept)                                   0.2008
## Age.Group0 to 4                               0.0024 **
## Age.Group5 to 9                               0.0024 **
## Age.Group10 to 14                             0.0062 **
## Age.Group15 to 19                            5.3e-07 ***
## Age.Group20 to 24                            9.4e-14 ***
## Age.Group25 to 29                             0.0085 **
## Age.Group35 to 39                             0.1647
## Age.Group40 to 44                             0.0383 *
## Age.Group45 to 54                             0.0124 *
## Age.Group55 to 64                             0.0042 **
## Age.Group65 or more                           0.0026 **
## EthnicityAmerican Indians & Alaskan Natives   0.0394 *
## EthnicityAsians & Pacific Islanders           0.6576
## EthnicityHispanics                            0.5603
## EthnicityNon-Hispanic Blacks                 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 256 on 584 degrees of freedom
## Multiple R-squared:  0.491,  Adjusted R-squared:  0.477
## F-statistic: 37.5 on 15 and 584 DF,  p-value: <2e-16

The coefficients and p-values have changed because the reference point is now different, but there are still a lot of stars on the righthand side of the summary output, so we know that age and ethnicity have an impact on infection rates.

Plotting and Inspecting Models

lm models have a plot method that lets you check the goodness of fit in six different ways. In its simplest form, you can just call plot(the_model), and it draws several plots one after another. A slightly better approach is to use the layout function to see all the plots together, as demonstrated in Figure 15-2:

plot_numbers <- 1:6
layout(matrix(plot_numbers, ncol = 2, byrow = TRUE))
plot(model4, plot_numbers)
Diagnostic plots for a linear model

Figure 15-2. Diagnostic plots for a linear model

At the top end, there are some values that are too high (notice the large positive residuals at the righthand side of the “Residuals vs Fitted” plot, the points far above the line in the “Normal Q-Q” plot, and the spikes in the “Cook’s distance” plot). In particular, rows 40, 41, and 160 have been singled out as outliers:

gonorrhoea[c(40, 41, 160), ]
##     Year Age.Group           Ethnicity Gender Rate
## 40  2007  15 to 19 Non-Hispanic Blacks Female 2239
## 41  2007  20 to 24 Non-Hispanic Blacks Female 2200
## 160 2008  15 to 19 Non-Hispanic Blacks Female 2233

These large values all refer to non-Hispanic black females, suggesting that we are perhaps missing an interaction term with ethnicity and gender.

The model variable that is returned by lm is a fairly complex beast. The output isn’t included here for brevity, but you can explore the structure of these model variables in the usual way:

str(model4)
unclass(model4)

There are many convenience functions for accessing the various components of the model, such as formula, nobs, residuals, fitted, and coefficients:

formula(model4)
## Rate ~ Age.Group + Ethnicity
## <environment: 0x000000004ed4e110>
nobs(model4)
## [1] 600
head(residuals(model4))
##       1       2       3       4       5       6
##  102.61  102.93   87.25 -282.38 -367.61 -125.38
head(fitted(model4))
##       1       2       3       4       5       6
## -102.31 -102.93  -87.05  313.38  444.51  188.78
head(coefficients(model4))
##       (Intercept)   Age.Group0 to 4   Age.Group5 to 9 Age.Group10 to 14
##             53.56           -155.87           -156.49           -140.60
## Age.Group15 to 19 Age.Group20 to 24
##            259.83            390.95

Beyond these, there are more functions for diagnosing the quality of linear regression models (listed on the ?influence.measures page), and you can access the R^2 value (“the fraction of variance explained by the model”) from the summary:

head(cooks.distance(model4))
##         1         2         3         4         5         6
## 0.0002824 0.0002842 0.0002042 0.0021390 0.0036250 0.0004217
summary(model4)$r.squared
## [1] 0.4906

These utility functions are great for providing alternative diagnostics. For example, if you don’t want a base graphics plot of a model, you could roll your own ggplot2 version. Figure 15-3 shows an example plot of residuals versus fitted values:

diagnostics <- data.frame(
  residuals = residuals(model4),
  fitted    = fitted(model4)
)
ggplot(diagnostics, aes(fitted, residuals)) +
  geom_point() +
  geom_smooth(method = "loess")
A ggplot2-based diagnostic plot of residuals vs. fitted values

Figure 15-3. A ggplot2-based diagnostic plot of residuals vs. fitted values

The real beauty of using models is that you can predict outcomes based on whatever model inputs interest you. It’s more fun when you have a time variable in the model so that you can predict the future, but in this case we can take a look at infection rates for specific demographics. In a completely self-interested fashion, I’m going to look at the infection rates for 30- to 34-year-old non-Hispanic white people:

new_data <- data.frame(
  Age.Group = "30 to 34",
  Ethnicity = "Non-Hispanic Whites"
)
predict(model4, new_data)
##     1
## 53.56

The model predicts an infection rate of 54 people per 100,000. Let’s compare it to the data for that group:

subset(
  gonorrhoea,
  Age.Group == "30 to 34" & Ethnicity == "Non-Hispanic Whites"
)
##     Year Age.Group           Ethnicity Gender Rate
## 7   2007  30 to 34 Non-Hispanic Whites   Male 41.0
## 19  2007  30 to 34 Non-Hispanic Whites Female 45.6
## 127 2008  30 to 34 Non-Hispanic Whites   Male 35.1
## 139 2008  30 to 34 Non-Hispanic Whites Female 40.8
## 247 2009  30 to 34 Non-Hispanic Whites   Male 34.6
## 259 2009  30 to 34 Non-Hispanic Whites Female 33.8
## 367 2010  30 to 34 Non-Hispanic Whites   Male 40.8
## 379 2010  30 to 34 Non-Hispanic Whites Female 38.5
## 487 2011  30 to 34 Non-Hispanic Whites   Male 48.0
## 499 2011  30 to 34 Non-Hispanic Whites Female 40.7

The data points range from 34 to 48, so the prediction is a little bit high.

Other Model Types

Linear regression is barely a hairline scratch on the surface of R’s modeling capabilities. As R is now the tool of choice in many university statistical departments, more or less every model conceived is available in R or one of its packages. There are many books that focus on the statistical side of using R, so this section only provides brief pointers on where to look for further information.

Tip

The modeling capabilities of R are spread across many packages written by many people, and consequently their syntax varies. The caret package provides wrappers to about 150 models to give them a consistent interface, along with some tools for model training and validation. Max Kuhn’s Applied Predictive Modeling is the definitive reference to this package.

The lm function for linear (ordinary least squares regression) models has a generalization, glm, that lets you specify different distributions for error terms and transformations on the response variable. You can use it for logistic regression (where the response variable is logical or categorical), amongst other things, and the syntax is almost identical to that of lm:

glm(true_or_false ~ some + predictor + variables, data, family = binomial())

John Fox’s An R Companion to Applied Regression is basically 472 pages of cool things you can do with the glm function.

The nlme package (which ships with R) contains the lme function for linear mixed-effects models and nlme for nonlinear mixed-effects models.[60] Again, the syntax is more or less the same as with lm:

lme(y ~ some + fixed + effects, data, random = ~ 1 | random / effects)

Mixed-Effects Models in S and S-PLUS by José Pinheiro and Doug Bates is the canonical reference, with loads of examples.

For response variables that are proportions, the betareg package contains a function of the same name that allows beta regression. This is especially important for Exercise 15-3.

For data miners (and others with high-dimensional datasets), the rpart package, which also ships with R, creates regression trees (a.k.a. decision trees). Even more excitingly, the randomForest package lets you create a whole forest of regression trees. C50 and mboost provide gradient boosting, which many regard as even better than random forests.

kmeans lets you do k-means clustering, and there are several packages that provide specialist extensions, such as kernlab for weighted k-means, kml for longitudinal k-means, trimclust for trimmed k-means, and skmeans for spherical k-means.

If you do a lot of data mining, you may be interested in Rattle, a GUI that gives easy access to R’s data-mining models. Graham Williams has written a book about it, Data Mining with Rattle and R, but start by visiting the website to see if it takes your fancy.

Social scientists may appreciate traditional dimension-reduction models: factor analysis is supported through the factanal method, and principal components analysis has two methods (princomp provides S-Plus compatibility, and prcomp uses a more modern, numerically stable algorithm).

The deSolve package contains many methods for solving systems of ordinary/partial/delay differential equations.

Summary

  • You can generate random numbers from pretty much any distribution ever conceived.
  • Most distributions also have functions available for calculating their PDF/CDF/inverse CDF.
  • Many modeling functions use formulae to specify the form of the model.
  • lm runs a linear regression, and there are many support functions for updating, diagnosing, and predicting with these models.
  • R can run a wide range of statistical models.

Test Your Knowledge: Quiz

Question 15-1
How would you create the same set of random numbers twice?
Question 15-2
What is the naming convention for PDF, CDF, and inverse CDF functions?
Question 15-3
What does a colon (:) mean in the context of model formulae?
Question 15-4
Which functions can you use to compare linear regression models?
Question 15-5
How do you determine the fraction of variance explained by a linear model?

Test Your Knowledge: Exercises

Exercise 15-1
  1. While typing this book, I make about three typos per paragraph. Use the PDF of the Poisson distribution, dpois, to find the probability that I make exactly three typos in a given paragraph. [5]
  2. A healthy 25-year-old woman having unprotected sex at the right time has a 25% chance of getting pregnant each month. Use the CDF for the negative binomial distribution, pnbinom, to calculate the probability that she will have become pregnant after a year. [5]
  3. You need 23 people to have a 50% chance that two or more of them share the same birthday. Use the inverse CDF of the birthday distribution, qbirthday, to calculate how many people you need to have a 90% chance of a shared birthday. [5]
Exercise 15-2

Re-run the linear regression analysis on the gonorrhoea dataset, considering only 15- to 34-year-olds. Are the significant predictor variables different this time? [15]

For bonus points, explore adding interaction terms into the model. [15]

Exercise 15-3

Install and load the betareg package. Explore the obama_vs_mccain dataset using beta regression, via the betareg function in that package. Use the Obama column as the response variable.

To keep things simple, remove the “District of Columbia” outlier, don’t bother with interactions, and only include one ethnicity and one religion column. (The ethnicity and religion columns aren’t independent because they represent fractions of a total.) In the absence of political understanding, you may trust the p-values for the purposes of updating models. Hint: You need to rescale the Obama column to range from 0 to 1. [30]



[58] If you’ve just started casually doodling a biplot, then well done, have some geek points. Plots of lots of variables are possible if you use a dimension-reduction technique like principal component analysis or factor analysis to give you fewer variables in practice.

[59] Please remind me to change this for the second edition!

[60] Mixed-effects models are regressions where some of your predictor variables affect the variance of the response rather than the mean. For example, if you measure oxygen levels in people’s bloodstreams, you might want to know how much variation there is between people and between measurements for a person, but you don’t care that oxygen levels are higher or lower for a specific person.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset