Finding and removing missing values in your dataset is not always a viable alternative, for either operative or methodological reasons. It is often preferable to simulate possible values for missing data and integrate those values within the observed data.
This recipe is based on the mice
package by Stef van Buuren. It provides an efficient algorithm for missing value substitution based on the multiple imputation technique.
Multiple imputation technique
The multiple imputation technique is a statistical solution to the problem of missing values.
The main idea behind this technique is to draw possible alternative values for each missing value and then, after a proper analysis of simulated values, populating the original dataset with synthetic data.
This recipe requires that you install and load the mice package:
install.packages("mice") library(mice)
For illustrative purposes, we will use the tidy_gdp
data frame created in the Preparing your data for analysis with the tidyr package recipe. This dataset is provided with the RStudio project for this book. You can download it by authenticating your account at http://packtpub.com.
In order to make missing values appear, we will have to transform the value column type from characters to numbers:
tidy_gdp$gdp <- as.numeric(tidy_gdp$gdp)
As shown in the previous recipe, you can now look for missing value patterns by leveraging the md.pattern()
function:
md.pattern(tidy_gdp$gdp) year gdp Country Name Country Code Indicator Name Indicator Code 10379 1 1 0 0 0 0 4 3757 1 0 0 0 0 0 5 0 3757 14136 14136 14136 14136 60301
mice()
function:simulation <- mice(tidy_gdp, method = "pmm")
The console will then print the following output:
iter imp variable 1 1 gdp 1 2 gdp 1 3 gdp 1 4 gdp 1 5 gdp 2 1 gdp 2 2 gdp 2 3 gdp 2 4 gdp 2 5 gdp 3 1 gdp 3 2 gdp 3 3 gdp 3 4 gdp 3 5 gdp 4 1 gdp 4 2 gdp 4 3 gdp 4 4 gdp 4 5 gdp 5 1 gdp 5 2 gdp 5 3 gdp 5 4 gdp 5 5 gdp
tidy_gdp_complete ← complete(simulation, action =1)
densityplot(simulation)
This function will show a plot representing simulated data in red and recorded data in blue. If the two density()
functions differ too much, this would mean that the resulting dataset should be considered unreliable:
If the check results in simulated data that is not reasonable, you should consider changing the statistical method used to generate data. You may also abandon the idea of using synthetic data, particularly if the missing value/recorded value ratio is really high.
In step 1, we generate data to substitute missing values using the mice()
function. This step leverages the mice()
function to produce multiple possible values to be used as substitutes for missing values within the tidy_gdp
data frame.
The number of possible values to be produced for each missing value is determined by the m
parameter, set by default to 5
. This is usually a sufficient number of iterations. However, especially with unsuccessful results, you could consider incrementing the number of simulations.
In step 2, we populate the original data with generated data. This step is easily performed using the complete()
function, which lets you choose between the number of m
values generated for any missing value. Then, substitute it for the missing one. This choice is made by specifying the action
parameter.
In step 3, we check for the reasonableness of the generated values. Using the densityplot()
function from the lattice
package (it should be preinstalled with every distribution of R), we can easily assess whether simulated values are reasonable compared to observed values.
In step 4, we iterate until we get satisfactory results. In the case of negative feedback from the density plot, you should consider changing the statistical method used to generate the missing values, as showed in the There's more... section.
Different simulation methods are available within the mice()
function. These methods can be selected using the method
argument:
pmm
(predictive mean matching)logreg
(logistic regression imputation)polyreg
(polytomous regression imputation)polr
(proportional odds model)If no value is provided, the mice()
function will automatically select a method, depending on the data type, by following those rules:
You can find more methods in the mice()
function documentation by running the following command:
?mice()
If you are dealing with missing values, you will find another great ally in the VIM package. This package also provides tools for the visualization of missing and/or imputed values.
You can find out more on this package in the official reference manual at https://cran.r-project.org/web/packages/VIM/VIM.pdf.