Chapter 15. Advanced methods for missing data

 

This chapter covers

  • Identification of missing data
  • Visualization of missing data patterns
  • Complete-case analysis
  • Multiple imputation of missing data

 

In previous chapters, we focused on the analysis of complete datasets (that is, data-sets without missing values). Although doing so has helped simplify the presentation of statistical and graphical methods, in the real world, missing data are ubiquitous.

In some ways, the impact of missing data is a subject that most of us want to avoid. Statistics books may not mention it or may limit discussion to a few paragraphs. Statistical packages offer automatic handling of missing data using methods that may not be optimal. Even though most data analyses (at least in social sciences) involve missing data, this topic is rarely mentioned in the methods and results sections of journal articles. Given how often missing values occur, and the degree to which their presence can invalidate study results, it’s fair to say that the subject has received insufficient attention outside of specialized books and courses.

Data can be missing for many reasons. Survey participants may forget to answer one or more questions, refuse to answer sensitive questions, or grow fatigued and fail to complete a long questionnaire. Study participants may miss appointments or drop out of a study prematurely. Recording equipment may fail, internet connections may be lost, and data may be miscoded. The presence of missing data may even be planned. For example, to increase study efficiency or reduce costs, you may choose not to collect all data from all participants. Finally, data may be lost for reasons that you’re never able to ascertain.

Unfortunately, most statistical methods assume that you’re working with complete matrices, vectors, and data frames. In most cases, you have to eliminate missing data before you address the substantive questions that led you to collect the data. You can eliminate missing data by (1) removing cases with missing data, or (2) replacing missing data with reasonable substitute values. In either case, the end result is a dataset without missing values.

In this chapter, we’ll look at both traditional and modern approaches for dealing with missing data. We’ll primarily use the VIM and mice packages. The command install.packages(c("VIM", "mice")) will download and install both.

To motivate the discussion, we’ll look at the mammal sleep dataset (sleep) provided in the VIM package (not to be confused with the sleep dataset describing the impact of drugs on sleep provided in the base installation). The data come from a study by Allison and Chichetti (1976) that examined the relationship between sleep, ecological, and constitutional variables for 62 mammal species. The authors were interested in why animals’ sleep requirements vary from species to species. The sleep variables served as the dependent variables, whereas the ecological and constitutional variables served as the independent or predictor variables.

Sleep variables included length of dreaming sleep (Dream), nondreaming sleep (NonD), and their sum (Sleep). The constitutional variables included body weight in kilograms (BodyWgt), brain weight in grams (BrainWgt), life span in years (Span), and gestation time in days (Gest). The ecological variables included degree to which species were preyed upon (Pred), degree of their exposure while sleeping (Exp), and overall danger (Danger) faced. The ecological variables were measured on 5-point rating scales that ranged from 1 (low) to 5 (high).

In their original article, Allison and Chichetti limited their analyses to the species that had complete data. We’ll go further, analyzing all 62 cases using a multiple imputation approach.

15.1. Steps in dealing with missing data

Readers new to the study of missing data will find a bewildering array of approaches, critiques, and methodologies. The classic text in this area is Little and Rubin (2002). Excellent, accessible reviews can be found in Allison (2001), Schafer and Graham (2002) and Schlomer, Bauman, and Card (2010). A comprehensive approach will usually include the following steps:

1.  Identify the missing data.

2.  Examine the causes of the missing data.

3.  Delete the cases containing missing data or replace (impute) the missing values with reasonable alternative data values.

Unfortunately, identifying missing data is usually the only unambiguous step. Learning why data are missing depends on your understanding of the processes that generated the data. Deciding how to treat missing values will depend on your estimation of which procedures will produce the most reliable and accurate results.

 

A classification system for missing data

Statisticians typically classify missing data into one of three types. These types are usually described in probabilistic terms, but the underlying ideas are straightforward. We’ll use the measurement of dreaming in the sleep study (where 12 animals have missing values) to illustrate each type in turn.

(1) Missing completely at random—If the presence of missing data on a variable is unrelated to any other observed or unobserved variable, then the data are missing completely at random (MCAR). If there’s no systematic reason why dream sleep is missing for these 12 animals, the data is said to be MCAR. Note that if every variable with missing data is MCAR, you can consider the complete cases to be a simple random sample from the larger dataset.

(2) Missing at random—If the presence of missing data on a variable is related to other observed variables but not to its own unobserved value, the data is missing at random (MAR). For example, if animals with smaller body weights are more likely to have missing values for dream sleep (perhaps because it’s harder to observe smaller animals), and the “missingness” is unrelated to an animal’s time spent dreaming, the data would be considered MAR. In this case, the presence or absence of dream sleep data would be random, once you controlled for body weight.

(3) Not missing at random—If the missing data for a variable is neither MCAR nor MAR, it is not missing at random (NMAR). For example, if animals that spend less time dreaming are also more likely to have a missing dream value (perhaps because it’s harder to measure shorter events), the data would be considered NMAR.

Most approaches to missing data assume that the data is either MCAR or MAR. In this case, you can ignore the mechanism producing the missing data and (after replacing or deleting the missing data) model the relationships of interest directly. Data that’s NMAR can be difficult to analyze properly. When data is NMAR, you have to model the mechanisms that produced the missing values, as well as the relationships of interest. (Current approaches to analyzing NMAR data include the use of selection models and pattern mixtures. The analysis of NMAR data can be quite complex and is beyond the scope of this book.)

 

There are many methods for dealing with missing data—and no guarantee that they’ll produce the same results. Figure 15.1 describes an array of methods used for handling incomplete data and the R packages that support them.

Figure 15.1. Methods for handling incomplete data, along with the R packages that support them

A complete review of missing data methodologies would require a book in itself. In this chapter, we’ll review methods for exploring missing values patterns and focus on the three most popular methods for dealing with incomplete data (a rational approach, listwise deletion, and multiple imputation). We’ll end the chapter with a brief discussion of other methods, including those that are useful in special circumstances.

15.2. Identifying missing values

To begin, let’s review the material introduced in chapter 4, section 4.5, and expand on it. R represents missing values using the symbol NA (not available) and impossible values by the symbol NaN (not a number). In addition, the symbols Inf and -Inf represent positive infinity and negative infinity, respectively. The functions is.na(), is.nan(), and is.infinite() can be used to identify missing, impossible, and infinite values respectively. Each returns either TRUE or FALSE. Examples are given in table 15.1.

Table 15.1. Examples of return values for the is.na(), is.nan(), and is.infinite() functions

x

is.na(x)

is.nan(x)

is.infinite(x)

x <- NA TRUE FALSE FALSE
x <- 0 / 0 TRUE TRUE FALSE
x <- 1 / 0 FALSE FALSE TRUE

These functions return an object that’s the same size as its argument, with each element replaced by TRUE if the element is of the type being tested, and FALSE otherwise. For example, let y <- c(1, 2, 3, NA). Then is.na(y) will return the vector c(FALSE, FALSE, FALSE, TRUE).

The function complete.cases() can be used to identify the rows in a matrix or data frame that don’t contain missing data. It returns a logical vector with TRUE for every row that contains complete cases and FALSE for every row that has one or more missing values.

Let’s apply this to the sleep dataset:

# load the dataset
data(sleep, package="VIM")

# list the rows that do not have missing values
sleep[complete.cases(sleep),]

# list the rows that have one or more missing values
sleep[!complete.cases(sleep),]

Examining the output reveals that 42 cases have complete data and 20 cases have one or more missing values.

Because the logical values TRUE and FALSE are equivalent to the numeric values 1 and 0, the sum() and mean() functions can be used to obtain useful information about missing data. Consider the following:

> sum(is.na(sleep$Dream))
[1] 12
> mean(is.na(sleep$Dream))
[1] 0.19
> mean(!complete.cases(sleep))
[1] 0.32

The results indicate that there are 12 missing values for the variable Dream. Nineteen percent of the cases have a missing value on this variable. In addition, 32 percent of the cases in the dataset contain one or more missing values.

There are two things to keep in mind when identifying missing values. First, the complete.cases() function only identifies NA and NaN as missing. Infinite values (Inf and -Inf) are treated as valid values. Second, you must use missing values functions, like those in this section, to identify the missing values in R data objects. Logical comparisons such as myvar == NA are never true.

Now that you know how to identify missing values programmatically, let’s look at tools that help you explore possible patterns in the occurrence of missing data.

15.3. Exploring missing values patterns

Before deciding how to deal with missing data, you’ll find it useful to determine which variables have missing values, in what amounts, and in what combinations. In this section, we’ll review tabular, graphical, and correlational methods for exploring missing values patterns. Ultimately, you want to understand why the data is missing. The answer will have implications for how you proceed with further analyses.

15.3.1. Tabulating missing values

You’ve already seen a rudimentary approach to identifying missing values. You can use the complete.cases() function from section 15.2 to list cases that are complete, or conversely, list cases that have one or more missing values. As the size of a dataset grows, though, it becomes a less attractive approach. In this case, you can turn to other R functions.

The md.pattern() function in the mice package will produce a tabulation of the missing data patterns in a matrix or data frame. Applying this function to the sleep dataset, you get the following:

> library(mice)
> data(sleep, package="VIM")
> md.pattern(sleep)
   BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD
42       1        1    1   1      1     1    1    1     1     1  0
 2       1        1    1   1      1     1    0    1     1     1  1
 3       1        1    1   1      1     1    1    0     1     1  1
 9       1        1    1   1      1     1    1    1     0     0  2
 2       1        1    1   1      1     0    1    1     1     0  2
 1       1        1    1   1      1     1    0    0     1     1  2
 2       1        1    1   1      1     0    1    1     0     0  3
 1       1        1    1   1      1     1    0    1     0     0  3
         0        0    0   0      0     4    4    4    12    14 38

The 1’s and 0’s in the body of the table indicate the missing values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a nonmissing value. The first row describes the pattern of “no missing values” (all elements are 1). The second row describes the pattern “no missing values except for Span." The first column indicates the number of cases in each missing data pattern, and the last column indicates the number of variables with missing values present in each pattern. Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone. Nine cases are missing both NonD and Dream values. The dataset contains a total of (42 × 0) + (2 × 1) + ... + (1 × 3) = 38 missing values. The last row gives the total number of missing values present on each variable.

15.3.2. Exploring missing data visually

Although the tabular output from the md.pattern() function is compact, I often find it easier to discern patterns visually. Luckily, the VIM package provides numerous functions for visualizing missing values patterns in datasets. In this section, we’ll review several, including aggr(), matrixplot(), and scattMiss().

The aggr() function plots the number of missing values for each variable alone and for each combination of variables. For example, the code

library("VIM")
aggr(sleep, prop=FALSE, numbers=TRUE)

produces the graph in figure 15.2. (The VIM package opens up a GUI interface. You can close it; we’ll be using code to accomplish the tasks in this chapter.)

Figure 15.2. aggr() produced plot of missing values patterns for the sleep dataset.

You can see that the variable NonD has the largest number of missing values (14), and that 2 mammals are missing NonD, Dream, and Sleep scores. Forty-two mammals have no missing data.

The statement aggr(sleep, prop=TRUE, numbers=TRUE) produces the same plot, but proportions are displayed instead of counts. The option numbers=FALSE (the default) suppresses the numeric labels.

The matrixplot() function produces a plot displaying the data for each case. A graph created using matrixplot(sleep) is displayed in figure 15.3. Here, the numeric data is rescaled to the interval [0, 1] and represented by grayscale colors, with lighter colors representing lower values and darker colors representing larger values. By default, missing values are represented in red. Note that in figure 15.3, red has been replaced with crosshatching by hand, so that the missing values are viewable in grayscale. It will look different when you create the graph yourself.

Figure 15.3. Matrix plot of actual and missing values by case (row) for the sleep dataset. The matrix is sorted by BodyWgt.

The graph is interactive: clicking on a column will re-sort the matrix by that variable. The rows in figure 15.3 are sorted in descending order by BodyWgt. A matrix plot allows you to see if the presence of missing values on one or more variables is related to the actual values of other variables. Here, you can see that there are no missing values on sleep variables (Dream, NonD, Sleep) for low values of body or brain weight (BodyWgt, BrainWgt).

The marginplot() function produces a scatter plot between two variables with information about missing values shown in the plot’s margins. Consider the relationship between amount of dream sleep and the length of a mammal’s gestation. The statement

marginplot(sleep[c("Gest","Dream")], pch=c(20),
           col=c("darkgray", "red", "blue"))

produces the graph in figure 15.4. The pch and col parameters are optional and provide control over the plotting symbols and colors used.

Figure 15.4. Scatter plot between amount of dream sleep and length of gestation, with information about missing data in the margins

The body of the graph displays the scatter plot between Gest and Dream (based on complete cases for the two variables). In the left margin, box plots display the distribution of Dream for mammals with (dark gray) and without (red) Gest values. Note that in grayscale, red is the darker shade. Four red dots represent the values of Dream for mammals missing Gest scores. In the bottom margin, the roles of Gest and Dream are reversed. You can see that a negative relationship exists between length of gestation and dream sleep and that dream sleep tends to be higher for mammals that are missing a gestation score. The number of observations with missing values on both variables at the same time is printed in blue at the intersection of both margins (bottom left).

The VIM package has many graphs that can help you understand the role of missing data in a dataset and is well worth exploring. There are functions to produce scatter plots, box plots, histograms, scatter plot matrices, parallel plots, rug plots, and bubble plots that incorporate information about missing values.

15.3.3. Using correlations to explore missing values

Before moving on, there’s one more approach worth noting. You can replace the data in a dataset with indicator variables, coded 1 for missing and 0 for present. The resulting matrix is sometimes called a shadow matrix. Correlating these indicator variables with each other and with the original (observed) variables can help you to see which variables tend to be missing together, as well as relationships between a variable’s “missingness” and the values of the other variables.

Consider the following code:

x <- as.data.frame(abs(is.na(sleep)))

The elements of data frame x are 1 if the corresponding element of sleep is missing and 0 otherwise. You can see this by viewing the first few rows of each:

> head(sleep, n=5)
    BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
1 6654.000   5712.0   NA    NA   3.3 38.6  645    3   5       3
2    1.000      6.6  6.3   2.0   8.3  4.5   42    3   1       3
3    3.385     44.5   NA    NA  12.5 14.0   60    1   1       1
4    0.920      5.7   NA    NA  16.5   NA   25    5   2       3
5 2547.000   4603.0  2.1   1.8   3.9 69.0  624    3   5       4

> head(x, n=5)
  BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
1       0        0    1     1     0    0    0    0   0      0
2       0        0    0     0     0    0    0    0   0      0
3       0        0    1     1     0    0    0    0   0      0
4       0        0    1     1     0    1    0    0   0      0
5       0        0    0     0     0    0    0    0   0      0

The statement

y <- x[which(sd(x) > 0)]

extracts the variables that have some (but not all) missing values, and

cor(y)

gives you the correlations among these indicator variables:

        NonD  Dream  Sleep   Span   Gest
NonD   1.000  0.907  0.486  0.015 -0.142
Dream  0.907  1.000  0.204  0.038 -0.129
Sleep  0.486  0.204  1.000 -0.069 -0.069
Span   0.015  0.038 -0.069  1.000  0.198
Gest  -0.142 -0.129 -0.069  0.198  1.000

Here, you can see that Dream and NonD tend to be missing together (r=0.91). To a lesser extent, Sleep and NonD tend to be missing together (r=0.49) and Sleep and Dream tend to be missing together (r=0.20).

Finally, you can look at the relationship between the presence of missing values in a variable and the observed values on other variables:

> cor(sleep, y, use="pairwise.complete.obs")
           NonD  Dream   Sleep   Span   Gest
BodyWgt   0.227  0.223  0.0017 -0.058 -0.054
BrainWgt  0.179  0.163  0.0079 -0.079 -0.073
NonD         NA     NA      NA -0.043 -0.046
Dream    -0.189     NA -0.1890  0.117  0.228
Sleep    -0.080 -0.080      NA  0.096  0.040
Span      0.083  0.060  0.0052     NA -0.065
Gest      0.202  0.051  0.1597 -0.175     NA
Pred      0.048 -0.068  0.2025  0.023 -0.201
Exp       0.245  0.127  0.2608 -0.193 -0.193
Danger    0.065 -0.067  0.2089 -0.067 -0.204
Warning message:
In cor(sleep, y, use = "pairwise.complete.obs") :
  the standard deviation is zero

In this correlation matrix, the rows are observed variables, and the columns are indicator variables representing missingness. You can ignore the warning message and NA values in the correlation matrix; they’re artifacts of our approach.

From the first column of the correlation matrix, you can see that nondreaming sleep scores are more likely to be missing for mammals with higher body weight (r=0.227), gestation period (r=0.202), and sleeping exposure (0.245). Other columns are read in a similar fashion. None of the correlations in this table are particularly large or striking, which suggests that the data deviates minimally from MCAR and may be MAR.

Note that you can never rule out the possibility that the data are NMAR because you don’t know what the actual values would have been for data that are missing. For example, you don’t know if there’s a relationship between the amount of dreaming a mammal engages in and the probability of obtaining a missing value on this variable. In the absence of strong external evidence to the contrary, we typically assume that data is either MCAR or MAR.

15.4. Understanding the sources and impact of missing data

We identify the amount, distribution, and pattern of missing data in order to evaluate (1) the potential mechanisms producing the missing data and (2) the impact of the missing data on our ability to answer substantive questions. In particular, we want to answer the following questions:

  • What percentage of the data is missing?
  • Is it concentrated in a few variables, or widely distributed?
  • Does it appear to be random?
  • Does the covariation of missing data with each other or with observed data suggest a possible mechanism that’s producing the missing values?

Answers to these questions will help determine which statistical methods are most appropriate for analyzing your data. For example, if the missing data are concentrated in a few relatively unimportant variables, you may be able to delete these variables and continue your analyses normally. If there’s a small amount of data (say less than 10 percent) that’s randomly distributed throughout the dataset (MCAR), you may be able to limit your analyses to cases with complete data and still get reliable and valid results. If you can assume that the data are either MCAR or MAR, you may be able to apply multiple imputation methods to arrive at valid conclusions. If the data are NMAR, you can turn to specialized methods, collect new data, or go into an easier and more rewarding profession.

Here are some examples:

  • In a recent survey employing paper questionnaires, I found that several items tended to be missing together. It became apparent that these items clustered together because participants didn’t realize that the third page of the questionnaire had a reverse side containing them. In this case, the data could be considered MCAR.
  • In another study, an education variable was frequently missing in a global survey of leadership styles. Investigation revealed that European participants were more likely to leave this item blank. It turned out that the categories didn’t make sense for participants in certain countries. In this case, the data was most likely MAR.
  • Finally, I was involved in a study of depression in which older patients were more likely to omit items describing depressed mood when compared with younger patients. Interviews revealed that older patients were loath to admit to such symptoms because doing so violated their values about keeping a “stiff upper lip.” Unfortunately, it was also determined that severely depressed patients were more likely to omit these items due to a sense of hopelessness and difficulties with concentration. In this case, the data had to be considered NMAR.

As you can see, the identification of patterns is only the first step. You need to bring your understanding of the research subject matter and the data collection process to bear in order to determine the source of the missing values.

Now that we’ve considered the source and impact of missing data, let’s see how standard statistical approaches can be altered to accommodate them. We’ll focus on three approaches that are very popular: a rational approach for recovering data, a traditional approach that involves deleting missing data, and a modern approach that involves the use of simulation. Along the way, we’ll briefly look at methods for specialized situations, and methods that have become obsolete and should be retired. Our goal will remain constant: to answer, as accurately as possible, the substantive questions that led us to collect the data, given the absence of complete information.

15.5. Rational approaches for dealing with incomplete data

In a rational approach, you use mathematical or logical relationships among variables to attempt to fill in or recover the missing values. A few examples will help clarify this approach.

In the sleep dataset, the variable Sleep is the sum of the Dream and NonD variables. If you know a mammal’s scores on any two, you can derive the third. Thus, if there were some observations that were missing only one of the three variables, you could recover the missing information through addition or subtraction.

As a second example, consider research that focuses on work/ life balance differences between generational cohorts (for example, Silents, Early Boomers, Late Boomers, Xers, Millennials), where cohorts are defined by their birth year. Participants are asked both their date of birth and their age. If date of birth is missing, you can recover their birth year (and therefore their generational cohort) by knowing their age and the date they completed the survey.

An example that uses logical relationships to recover missing data comes from a set of leadership studies in which participants were asked if they were a manager (yes/ no) and the number of their direct reports (integer). If they left the manager question blank but indicated that they had one or more direct reports, it would be reasonable to infer that they were a manager.

As a final example, I frequently engage in gender research that compares the leadership styles and effectiveness of men and women. Participants complete surveys that include their name (first and last), gender, and a detailed assessment of their leadership approach and impact. If participants leave the gender question blank, I have to impute the value in order to include them in the research. In one recent study of 66,000 managers, 11,000 (17 percent) had a missing value for gender.

To remedy the situation, I employed the following rational process. First, I cross-tabulated first name with gender. Some first names were associated with males, some with females, and some with both. For example, “William” appeared 417 times and was always a male. Conversely, the name “Chris” appeared 237 times but was sometimes a male (86 percent) and sometimes a female (14 percent). If a first name appeared more than 20 times in the dataset and was always associated with males or with females (but never both), I assumed that the name represented a single gender. I used this assumption to create a gender lookup table for gender-specific first names. Using this lookup table for participants with missing gender values, I was able to recover 7,000 cases (63 percent of the missing responses).

A rational approach typically requires creativity and thoughtfulness, along with a degree of data management skill. Data recovery may be exact (as in the sleep example) or approximate (as in the gender example). In the next section, we’ll explore an approach that creates complete datasets by removing observations.

15.6. Complete-case analysis (listwise deletion)

In complete-case analysis, only observations containing valid data values on every variable are retained for further analysis. Practically, this involves deleting any row containing one or more missing values, and is also known as listwise, or case-wise, deletion. Most popular statistical packages employ listwise deletion as the default approach for handling missing data. In fact, it’s so common that many analysts carrying out analyses like regression or ANOVA may not even realize that there’s a “missing values problem” to be dealt with!

The function complete.cases() can be used to save the cases (rows) of a matrix or data frame without missing data:

newdata <- mydata[complete.cases(mydata),]

The same result can be accomplished with the na.omit function:

newdata <- na.omit(mydata)

In both statements, any rows containing missing data are deleted from mydata before the results are saved to newdata.

Suppose you’re interested in the correlations among the variables in the sleep study. Applying listwise deletion, you’d delete all mammals with missing data prior to calculating the correlations:

> options(digits=1)
> cor(na.omit(sleep))
          BodyWgt BrainWgt NonD Dream Sleep  Span  Gest  Pred  Exp Danger
BodyWgt      1.00     0.96 -0.4 -0.07  -0.3  0.47  0.71  0.10  0.4   0.26
BrainWgt     0.96     1.00 -0.4 -0.07  -0.3  0.63  0.73 -0.02  0.3   0.15
NonD        -0.39    -0.39  1.0  0.52   1.0 -0.37 -0.61 -0.35 -0.6  -0.53
Dream       -0.07    -0.07  0.5  1.00   0.7 -0.27 -0.41 -0.40 -0.5  -0.57
Sleep       -0.34    -0.34  1.0  0.72   1.0 -0.38 -0.61 -0.40 -0.6  -0.60
Span         0.47     0.63 -0.4 -0.27  -0.4  1.00  0.65 -0.17  0.3   0.01
Gest         0.71     0.73 -0.6 -0.41  -0.6  0.65  1.00  0.09  0.6   0.31
Pred         0.10    -0.02 -0.4 -0.40  -0.4 -0.17  0.09  1.00  0.6   0.93
Exp          0.41     0.32 -0.6 -0.50  -0.6  0.32  0.57  0.63  1.0   0.79
Danger       0.26     0.15 -0.5 -0.57  -0.6  0.01  0.31  0.93  0.8   1.00

The correlations in this table are based solely on the 42 mammals that have complete data on all variables. (Note that the statement cor(sleep, use="complete.obs") would have produced the same results.)

If you wanted to study the impact of life span and length of gestation on the amount of dream sleep, you could employ linear regression with listwise deletion:

> fit <- lm(Dream ~ Span + Gest, data=na.omit(sleep))
> summary(fit)
Call:
lm(formula = Dream ~ Span + Gest, data = na.omit(sleep))

Residuals:
   Min     1Q Median     3Q    Max
-2.333 -0.915 -0.221  0.382  4.183

Coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)  2.480122   0.298476    8.31  3.7e-10 ***
Span        -0.000472   0.013130   -0.04   0.971
Gest        -0.004394   0.002081   -2.11   0.041  *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1 on 39 degrees of freedom
Multiple R-squared: 0.167,      Adjusted R-squared: 0.125
F-statistic: 3.92 on 2 and 39 DF,  p-value: 0.0282

Here you see that mammals with shorter gestation periods have more dream sleep (controlling for life span) and that life span is unrelated to dream sleep when controlling for gestation period. The analysis is based on 42 cases with complete data.

In the previous example, what would have happened if data=na.omit(sleep) had been replaced with data=sleep? Like many R function, lm() uses a limited definition of listwise deletion. Cases with any missing data on the variables fitted by the function (Dream, Span, and Gest in this case) would have been deleted. The analysis would have been based on 44 cases.

Listwise deletion assumes that the data are MCAR (that is, the complete observations are a random subsample of the full dataset). In the current example, we’ve assumed that the 42 mammals used are a random subsample of the 62 mammals collected. To the degree that the MCAR assumption is violated, the resulting regression parameters will be biased. Deleting all observations with missing data can also reduce statistical power by reducing the available sample size. In the current example, listwise deletion reduced the sample size by 32 percent. Next, we’ll consider an approach that employs the entire dataset (including cases with missing data).

15.7. Multiple imputation

Multiple imputation (MI) provides an approach to missing values that’s based on repeated simulations. MI is frequently the method of choice for complex missing values problems. In MI, a set of complete datasets (typically 3 to 10) is generated from an existing dataset containing missing values. Monte Carlo methods are used to fill in the missing data in each of the simulated datasets. Standard statistical methods are applied to each of the simulated datasets, and the outcomes are combined to provide estimated results and confidence intervals that take into account the uncertainty introduced by the missing values. Good implementations are available in R through the Amelia, mice, and mi packages. In this section we’ll focus on the approach provided by the mice (multivariate imputation by chained equations) package.

To understand how the mice package operates, consider the diagram in figure 15.5.

Figure 15.5. Steps in applying multiple imputation to missing data via the mice approach.

The function mice() starts with a data frame containing missing data and returns an object containing several complete datasets (the default is 5). Each complete dataset is created by imputing values for the missing data in the original data frame. There’s a random component to the imputations, so each complete dataset is slightly different. The with() function is then used to apply a statistical model (for example, linear or generalized linear model) to each complete dataset in turn. Finally, the pool() function combines the results of these separate analyses into a single set of results. The standard errors and p-values in this final model correctly reflect the uncertainty produced by both the missing values and the multiple imputations.

 

How does the mice() function impute missing values?

Missing values are imputed by Gibbs sampling. By default, each variable containing missing values is predicted from all other variables in the dataset. These prediction equations are used to impute plausible values for the missing data. The process iterates until convergence over the missing values is achieved. For each variable, the user can choose the form of the prediction model (called an elementary imputation method), and the variables entered into it.

By default, predictive mean matching is used to replace missing data on continuous variables, while logistic or polytomous logistic regression is used for target variables that are dichotomous (factors with two levels) or polytomous (factors with more than two levels) respectively. Other elementary imputation methods include Bayesian linear regression, discriminant function analysis, two-level normal imputation, and random sampling from observed values. Users can supply their own methods as well.

 

An analysis based on the mice package will typically conform to the following structure:

library(mice)
imp <- mice(mydata, m)
fit <- with(imp, analysis)
pooled <- pool(fit)
summary(pooled)

where

  • mydata is a matrix or data frame containing missing values.
  • imp is a list object containing the m imputed datasets, along with information on how the imputations were accomplished. By default, m=5.
  • analysis is a formula object specifying the statistical analysis to be applied to each of the m imputed datasets. Examples include lm() for linear regression models, glm() for generalized linear models, gam() for generalized additive models, and nbrm() for negative binomial models. Formulas within the parentheses give the response variables on the left of the ~ and the predictor variables (separated by + signs) on the right.
  • fit is a list object containing the results of the m separate statistical analyses.
  • pooled is a list object containing the averaged results of these m statistical analyses.

Let’s apply multiple imputation to our sleep dataset. We’ll repeat the analysis from section 15.6, but this time, use all 62 mammals. Set the seed value for the random number generator to 1234 so that your results will match mine.

> library(mice)
> data(sleep, package="VIM")
> imp <- mice(sleep, seed=1234)

 [...output deleted to save space...]

> fit <- with(imp, lm(Dream ~ Span + Gest))
> pooled <- pool(fit)
> summary(pooled)
                 est      se      t   df Pr(>|t|)    lo 95
(Intercept)  2.58858 0.27552  9.395 52.1 8.34e-13  2.03576
Span        -0.00276 0.01295 -0.213 52.9 8.32e-01 -0.02874
Gest        -0.00421 0.00157 -2.671 55.6 9.91e-03 -0.00736
               hi 95 nmis     fmi
(Intercept)  3.14141    NA 0.0870
Span         0.02322     4 0.0806
Gest        -0.00105     4 0.0537

Here, you see that the regression coefficient for Span isn’t significant (p ≅ 0.08), and the coefficient for Gest is significant at the p<0.01 level. If you compare these results with those produced by a complete case analysis (section 15.6), you see that you’d come to the same conclusions in this instance. Length of gestation has a (statistically) significant, negative relationship with amount of dream sleep, controlling for life span. Although the complete-case analysis was based on the 42 mammals with complete data, the current analysis is based on information gathered from the full set of 62 mammals. By the way, the fmi column reports the fraction of missing information (that is, the proportion of variability that is attributable to the uncertainty introduced by the missing data).

You can access more information about the imputation by examining the objects created in the analysis. For example, let’s view a summary of the imp object:

> imp

Multiply imputed data set
Call:
mice(data = sleep, seed = 1234)
Number of multiple imputations:  5
Missing cells per column:
 BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest      Pred
       0        0       14       12        4        4        4         0
     Exp    Danger
       0         0
Imputation methods:
 BodyWgt BrainWgt     NonD    Dream    Sleep     Span     Gest      Pred
      ""       ""    "pmm"    "pmm"    "pmm"    "pmm"    "pmm"        ""
     Exp    Danger
      ""        ""
VisitSequence:
  NonD Dream Sleep  Span  Gest
    3     4     5     6      7
PredictorMatrix:
          BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
BodyWgt         0        0    0     0     0    0    0    0   0      0
BrainWgt        0        0    0     0     0    0    0    0   0      0
NonD            1        1    0     1     1    1    1    1   1      1
Dream           1        1    1     0     1    1    1    1   1      1
Sleep           1        1    1     1     0    1    1    1   1      1
Span            1        1    1     1     1    0    1    1   1      1
Gest            1        1    1     1     1    1    0    1   1      1
Pred            0        0    0     0     0    0    0    0   0      0
Exp             0        0    0     0     0    0    0    0   0      0
Danger          0        0    0     0     0    0    0    0   0      0
Random generator seed value:  1234

From the resulting output, you can see that five synthetic datasets were created, and that the predictive mean matching (pmm) method was used for each variable with missing data. No imputation ("") was needed for BodyWgt, BrainWgt, Pred, Exp, or Danger, because they had no missing values. The Visit Sequence tells you that variables were imputed from right to left, starting with NonD and ending with Gest. Finally, the Predictor Matrix indicates that each variable with missing data was imputed using all the other variables in the dataset. (In this matrix, the rows represent the variables being imputed, the columns represent the variables used for the imputation, and 1’s/0’s indicate used/not used).

You can view the actual imputations by looking at subcomponents of the imp object. For example,

> imp$imp$Dream
     1   2   3   4   5
1  0.5 0.5 0.5 0.5 0.0
3  2.3 2.4 1.9 1.5 2.4
4  1.2 1.3 5.6 2.3 1.3
14 0.6 1.0 0.0 0.3 0.5
24 1.2 1.0 5.6 1.0 6.6
26 1.9 6.6 0.9 2.2 2.0
30 1.0 1.2 2.6 2.3 1.4
31 5.6 0.5 1.2 0.5 1.4
47 0.7 0.6 1.4 1.8 3.6
53 0.7 0.5 0.7 0.5 0.5
55 0.5 2.4 0.7 2.6 2.6
62 1.9 1.4 3.6 5.6 6.6

displays the five imputed values for each of the 12 mammals with missing data on the Dream variable. A review of these matrices helps you determine if the imputed values are reasonable. A negative value for length of sleep might give you pause (or nightmares).

You can view each of the m imputed datasets via the complete() function. The format is

complete(imp, action=#)

where # specifies one of the m synthetically complete datasets. For example,

> dataset3 <- complete(imp, action=3)
> dataset3
  BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger
1 6654.00   5712.0  2.1   0.5   3.3 38.6  645    3   5      3
2    1.00      6.6  6.3   2.0   8.3  4.5   42    3   1      3
3    3.38     44.5 10.6   1.9  12.5 14.0   60    1   1      1
4    0.92      5.7 11.0   5.6  16.5  4.7   25    5   2      3
5 2547.00   4603.0  2.1   1.8   3.9 69.0  624    3   5      4
6   10.55    179.5  9.1   0.7   9.8 27.0  180    4   4      4
[...output deleted to save space...]

displays the third (out of five) complete datasets created by the multiple imputation process.

Due to space limitations, we’ve only briefly considered the MI implementation provided in the mice package. The mi and Amelia packages also contain valuable approaches. If you are interested in the multiple imputation approach to missing data, I recommend the following resources:

Each can help to reinforce and extend your understanding of this important, but underutilized, methodology.

15.8. Other approaches to missing data

R supports several other approaches for dealing with missing data. Although not as broadly applicable as the methods described thus far, the packages described in table 15.2 offer functions that can be quite useful in specialized circumstances.

Table 15.2. Specialized methods for dealing with missing data

Package

Description

Hmisc Contains numerous functions supporting simple imputation, multiple imputation, and imputation using canonical variates
mvnmle Maximum likelihood estimation for multivariate normal data with missing values
cat Multiple imputation of multivariate categorical data under log-linear models
arrayImpute, arrayMissPattern, SeqKnn Useful functions for dealing with missing microarray data
longitudinalData Contains utility functions, including interpolation routines for imputing missing time series values
kmi Kaplan-Meier multiple imputation for survival analysis with missing data
mix Multiple imputation for mixed categorical and continuous data under the general location model
pan Multiple imputation for multivariate panel or clustered data

Finally, there are two methods for dealing with missing data that are still in use, but should now be considered obsolete. They are pairwise deletion and simple imputation.

15.8.1. Pairwise deletion

Pairwise deletion is often considered an alternative to listwise deletion when working with datasets containing missing values. In pairwise deletion, observations are only deleted if they’re missing data for the variables involved in a specific analysis. Consider the following code:

> cor(sleep, use="pairwise.complete.obs")
          BodyWgt BrainWgt NonD Dream Sleep  Span Gest  Pred  Exp Danger
BodyWgt      1.00     0.93 -0.4  -0.1  -0.3  0.30  0.7  0.06  0.3   0.13
BrainWgt     0.93     1.00 -0.4  -0.1  -0.4  0.51  0.7  0.03  0.4   0.15
NonD        -0.38    -0.37  1.0   0.5   1.0 -0.38 -0.6 -0.32 -0.5  -0.48
Dream       -0.11    -0.11  0.5   1.0   0.7 -0.30 -0.5 -0.45 -0.5  -0.58
Sleep       -0.31    -0.36  1.0   0.7   1.0 -0.41 -0.6 -0.40 -0.6  -0.59
Span         0.30     0.51 -0.4  -0.3  -0.4  1.00  0.6 -0.10  0.4   0.06
Gest         0.65     0.75 -0.6  -0.5  -0.6  0.61  1.0  0.20  0.6   0.38
Pred         0.06     0.03 -0.3  -0.4  -0.4 -0.10  0.2  1.00  0.6   0.92
Exp          0.34     0.37 -0.5  -0.5  -0.6  0.36  0.6  0.62  1.0   0.79
Danger       0.13     0.15 -0.5  -0.6  -0.6  0.06  0.4  0.92  0.8   1.00

In this example, correlations between any two variables use all available observations for those two variables (ignoring the other variables). The correlation between BodyWgt and BrainWgt is based on all 62 mammals (the number of mammals with data on both variables). The correlation between BodyWgt and NonD is based on the 42 mammals, and the correlation between Dream and NonDream is based on 46 mammals.

Although pairwise deletion appears to use all available data, in fact each calculation is based on a different subset of the data. This can lead to distorted and difficult-to-interpret results. I recommend staying away from this approach.

15.8.2. Simple (nonstochastic) imputation

In simple imputation, the missing values in a variable are replaced with a single value (for example, mean, median, or mode). Using mean substitution you could replace missing values on Dream with the value 1.97 and missing values on NonD with the value 8.67 (the means on Dream and NonD, respectively). Note that the substitution is nonstochas-tic, meaning that random error isn’t introduced (unlike multiple imputation).

An advantage to simple imputation is that it solves the “missing values problem” without reducing the sample size available for the analyses. Simple imputation is, well, simple, but it produces biased results for data that aren’t MCAR. If there are moderate to large amounts of missing data, simple imputation is likely to underestimate standard errors, distort correlations among variables, and produce incorrect p-values in statistical tests. Like pairwise deletion, I recommend avoiding this approach for most missing data problems.

15.9. Summary

Most statistical methods assume that the input data is complete and doesn’t include missing values (for example, NA, NaN, Inf). But most datasets in real-world settings contain missing values. Therefore, you must either delete the missing values or replace them with reasonable substitute values before continuing with the desired analyses. Often, statistical packages will provide default methods for handling missing data, but these approaches may not be optimal. Therefore, it’s important that you understand the various approaches available, and the ramifications of using each.

In this chapter, we examined methods for identifying missing values and exploring patterns of missing data. Our goal was to understand the mechanisms that led to the missing data and their possible impact on subsequent analyses. We then reviewed three popular methods for dealing with missing data: a rational approach, listwise deletion, and the use of multiple imputation.

Rational approaches can be used to recover missing values when there are redundancies in the data, or external information that can be brought to bear on the problem. The listwise deletion of missing data is useful if the data are MCAR and the subsequent sample size reduction doesn’t seriously impact the power of statistical tests. Multiple imputation is rapidly becoming the method of choice for complex missing data problems when you can assume that the data are MCAR or MAR. Although many analysts may be unfamiliar with multiple imputation strategies, user-contributed packages (mice, mi, Amelia) make them readily accessible. I believe that we’ll see a rapid growth in their use over the next few years.

We ended the chapter by briefly mentioning R packages that provide specialized approaches for dealing with missing data, and singled out general approaches for handling missing data (pairwise deletion, simple imputation) that should be avoided.

In the next chapter, we’ll explore advanced graphical methods, including the use of lattice graphs, the ggplot2 system, and interactive graphical methods.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset