Substituting missing values using the mice package

Finding and removing missing values in your dataset is not always a viable alternative, for either operative or methodological reasons. It is often preferable to simulate possible values for missing data and integrate those values within the observed data.

This recipe is based on the mice package by Stef van Buuren. It provides an efficient algorithm for missing value substitution based on the multiple imputation technique.

Note

Multiple imputation technique

The multiple imputation technique is a statistical solution to the problem of missing values.

The main idea behind this technique is to draw possible alternative values for each missing value and then, after a proper analysis of simulated values, populating the original dataset with synthetic data.

Getting ready

This recipe requires that you install and load the mice package:

install.packages("mice")
library(mice)

For illustrative purposes, we will use the tidy_gdp data frame created in the Preparing your data for analysis with the tidyr package recipe. This dataset is provided with the RStudio project for this book. You can download it by authenticating your account at http://packtpub.com.

In order to make missing values appear, we will have to transform the value column type from characters to numbers:

tidy_gdp$gdp <- as.numeric(tidy_gdp$gdp)

As shown in the previous recipe, you can now look for missing value patterns by leveraging the md.pattern() function:

md.pattern(tidy_gdp$gdp)
      year  gdp Country Name Country Code Indicator Name Indicator Code
10379    1    1            0            0              0              0     4
 3757    1    0            0            0              0              0     5
         0 3757        14136        14136          14136          14136 60301

How to do it...

  1. Generate the data to substitute missing values using the mice() function:
    simulation   <- mice(tidy_gdp, method = "pmm")
    

    The console will then print the following output:

    iter imp variable
      1   1  gdp
      1   2  gdp
      1   3  gdp
      1   4  gdp
      1   5  gdp
      2   1  gdp
      2   2  gdp
      2   3  gdp
      2   4  gdp
      2   5  gdp
      3   1  gdp
      3   2  gdp
      3   3  gdp
      3   4  gdp
      3   5  gdp
      4   1  gdp
      4   2  gdp
      4   3  gdp
      4   4  gdp
      4   5  gdp
      5   1  gdp
      5   2  gdp
      5   3  gdp
      5   4  gdp
      5   5  gdp
    
  2. Populate the original data with generated data:
    tidy_gdp_complete ← complete(simulation, action =1)
    
  3. Check the reasonableness of the generated values:
      densityplot(simulation)
    

    This function will show a plot representing simulated data in red and recorded data in blue. If the two density() functions differ too much, this would mean that the resulting dataset should be considered unreliable:

    How to do it...
  4. Iterate the procedure until you get satisfactory results.

If the check results in simulated data that is not reasonable, you should consider changing the statistical method used to generate data. You may also abandon the idea of using synthetic data, particularly if the missing value/recorded value ratio is really high.

How it works...

In step 1, we generate data to substitute missing values using the mice() function. This step leverages the mice() function to produce multiple possible values to be used as substitutes for missing values within the tidy_gdp data frame.

The number of possible values to be produced for each missing value is determined by the m parameter, set by default to 5. This is usually a sufficient number of iterations. However, especially with unsuccessful results, you could consider incrementing the number of simulations.

In step 2, we populate the original data with generated data. This step is easily performed using the complete() function, which lets you choose between the number of m values generated for any missing value. Then, substitute it for the missing one. This choice is made by specifying the action parameter.

In step 3, we check for the reasonableness of the generated values. Using the densityplot() function from the lattice package (it should be preinstalled with every distribution of R), we can easily assess whether simulated values are reasonable compared to observed values.

In step 4, we iterate until we get satisfactory results. In the case of negative feedback from the density plot, you should consider changing the statistical method used to generate the missing values, as showed in the There's more... section.

There's more...

Different simulation methods are available within the mice() function. These methods can be selected using the method argument:

  • pmm (predictive mean matching)
  • logreg (logistic regression imputation)
  • polyreg (polytomous regression imputation)
  • polr (proportional odds model)

If no value is provided, the mice() function will automatically select a method, depending on the data type, by following those rules:

  • Numeric data → predictive mean matching
  • Binary data/factor with two levels → logistic regression imputation
  • Unordered categorical data with two or more levels → polytomous regression imputation
  • Ordered categorical data with two or more levels → proportional odds model

You can find more methods in the mice() function documentation by running the following command:

?mice()

If you are dealing with missing values, you will find another great ally in the VIM package. This package also provides tools for the visualization of missing and/or imputed values.

You can find out more on this package in the official reference manual at https://cran.r-project.org/web/packages/VIM/VIM.pdf.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset