Detecting and removing missing values

Missing values are values that should have been recorded but, for some reason, weren't actually recorded. Those values are different, from values without meaning, represented in R with NaN (not a number).

Most of us understood missing values due to circumstances such as the following one:

> x <- c(1,2,3,NA,4)
> mean(x)
[1] NA

"Oh come on, I know you can do it. Just ignore that useless NA" was probably your reaction, or at least it was mine.

Fortunately, R comes packed with good functions for missing value detection and handling.

In this recipe and the following one, we will see two opposite approaches to missing value handling:

  • Removing missing values
  • Simulating missing values by interpolation

I have to warn you that removing missing values can be considered right in a really small number of cases, since it compromises the integrity of your data sources and can greatly reduce the reliability of your results.

Nevertheless, if you are strongly willing to do this, I will show you how to do it in a really effective way, using the md.pattern() and complete.cases() functions from the mice package by Stef van Buuren.

Getting ready

Before applying this recipe, you will need to install and load the mice package:

install.packages("mice")
library(mice)

How to do it...

  1. Find where the missing values are located:
    md.pattern(tidy_gdp)
    

    This will result in an output similar to the following screenshot:

    How to do it...

    It shows us that in 10379 cases, Country Name, Country Code, Indicator Name, and Indicator Code are missing, and in 3757 cases, only the year is present and the rest is missing.

  2. Remove rows where data for a given column are missing:
    tidy_gdp_naomit <- subset(tidy_gdp,
    tidy_gdp$gdp!=complete.cases(tidy_gdp$gdp))
    

    The tidy_gdp_naomit command will now contain only observations where GDP was actually recorded.

  3. Check the result:
    md.pattern(tidy_gdp_naomit)
    

    It should now result in a matrix where no missing value cases are shown for the gdp column:

    How to do it...

How it works...

In step 1, we find where missing values are located. The md.pattern() function from the mice package is a really useful function. It gives you a clear view of where missing values are located, helping you in decisions regarding exclusions or substitution. You can refer to the next recipe for missing value substitution.

In step 2, we remove rows where data for a given column is missing. The Complete.cases() function lists all the data in a vector not equal to NA. By posing tidy_gdp$gdp!=complete.cases(tidy_gdp$gdp), we are just filtering for all the data available.

In step 3, we check the result using the md.pattern() function once again so that we can easily check for the persistence of NA values after our detection and removal procedures.

There's more...

Another way to deal with missing values is simply ignoring them in our computations. This is a built-in option for a really large number of R functions. It is usually expressed with the na.rm argument. A useful piece of advice to keep in mind is that ignoring NA values is not always a solution without consequences.

Take, for instance, the average computation on the following records vector:

records <- c(NA,4,3,6,NA)

Let's ignore NA values by posing the na.rm argument as true:

mean_na_ignoring <- mean(records, na.rm = TRUE)

We will obtain:

[1] 4.333333

This is a really different number from the one we will obtain when considering missing values as records with 0 value:

records­ ← x[is.na(x)] <- 0
> mean(records)
[1] 2.6

This simple example shows you why missing values need to be handled carefully.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset