Chapter 13. Cleaning and Transforming

No matter what format you are given data in, it’s almost always the wrong one for what you want to do with it, and no matter who gave it to you, it’s almost always dirty. Cleaning and transforming data may not be the fun part of data analysis, but you’ll probably spend more of your life than you care to doing it. Fortunately, R has a wide selection of tools to help with these tasks.

Chapter Goals

After reading this chapter, you should:

  • Know how to manipulate strings and clean categorical variables
  • Be able to subset and transform data frames
  • Be able to change the shape of a data frame from wide to long and back again
  • Understand sorting and ordering

Cleaning Strings

Back in Chapter 7, we looked at some simple string manipulation tasks like combining strings together using paste, and extracting sections of a string using substring.

One really common problem is when logical values have been encoded in a way that R doesn’t understand. In the alpe_d_huez cycling dataset, the DrugUse column (denoting whether or not allegations of drug use have been made about each rider’s performance), values have been encoded as "Y" and "N" rather than TRUE or FALSE. For this sort of simple matching, we can directly replace each string with the correct logical value:

yn_to_logical <- function(x)
{
  y <- rep.int(NA, length(x))
  y[x == "Y"] <- TRUE
  y[x == "N"] <- FALSE
  y
}

Setting values to NA by default lets us deal with strings that don’t match "Y" or "N". We call the function in the obvious way:

alpe_d_huez$DrugUse <- yn_to_logical(alpe_d_huez$DrugUse)

This direct replacement of one string with another doesn’t scale very well to having lots of choices of string. If you have ten thousand possible inputs, then a function to change each one would be very hard to write without errors, and even harder to maintain.

Fortunately, much more sophisticated manipulation is possible, and it is relatively easy to detect, extract, and replace parts of strings that match a pattern. R has a suite of built-in functions for handling these tasks, (loosely) based upon the Unix grep tool. They accept a string to manipulate and a regular expression to match. As mentioned in Chapter 1, regular expressions are patterns that provide a flexible means of describing the contents of a string. They are very useful for matching complex string-data types like phone numbers or email addresses.[41]

The grep, grepl, and regexpr functions all find strings that match a pattern, and sub and gsub replace matching strings. In classic R style, these functions are meticulously correct and very powerful, but suffer from funny naming conventions, quirky argument ordering, and odd return values that have arisen for historical reasons. Fortunately, in the same way that plyr provides a consistent wrapper around apply functions and lubridate provides a consistent wrapper around the date-time functions, the stringr package provides a consistent wrapper around the string manipulation functions. The difference is that while you will occasionally need to use a base apply or date-time function, stringr is advanced enough that you shouldn’t need to bother with grep at all. So, take a look at the ?grep help page, but don’t devote too much of your brain to it.

These next examples use the english_monarchs dataset from the learningr package. It contains the names and dates of rulers from post-Roman times (in the fifth century CE), when England was split into seven regions known as the heptarchy, until England took over Ireland in the early thirteenth century:

data(english_monarchs, package = "learningr")
head(english_monarchs)
##       name     house start.of.reign end.of.reign      domain
## 1    Wehha Wuffingas             NA          571 East Anglia
## 2    Wuffa Wuffingas            571          578 East Anglia
## 3   Tytila Wuffingas            578          616 East Anglia
## 4  Rædwald Wuffingas            616          627 East Anglia
## 5 Eorpwald Wuffingas            627          627 East Anglia
## 6 Ricberht Wuffingas            627          630 East Anglia
##   length.of.reign.years reign.was.more.than.30.years
## 1                    NA                           NA
## 2                     7                        FALSE
## 3                    38                         TRUE
## 4                    11                        FALSE
## 5                     0                        FALSE
## 6                     3                        FALSE

One of the problems with history is that there is an awful lot of it. Fortunately, odd or messy data can be a really good indicator of the interesting bits of history, so we can narrow it down to the good stuff. For example, although there were seven territories that came together to form England, their boundaries were far from fixed, and sometimes one kingdom would conquer another. We can find these convergences by searching for commas in the domain column. To detect a pattern, we use the str_detect function. The fixed function tells str_detect that we are looking for a fixed string (a comma) rather than a regular expression. str_detect returns a logical vector that we can use for an index:

library(stringr)
multiple_kingdoms <- str_detect(english_monarchs$domain, fixed(","))
english_monarchs[multiple_kingdoms, c("name", "domain")]
##                           name                    domain
## 17                        Offa       East Anglia, Mercia
## 18                        Offa East Anglia, Kent, Mercia
## 19           Offa and Ecgfrith East Anglia, Kent, Mercia
## 20                    Ecgfrith East Anglia, Kent, Mercia
## 22                     Cœnwulf East Anglia, Kent, Mercia
## 23        Cœnwulf and Cynehelm East Anglia, Kent, Mercia
## 24                     Cœnwulf East Anglia, Kent, Mercia
## 25                    Ceolwulf East Anglia, Kent, Mercia
## 26                   Beornwulf       East Anglia, Mercia
## 82      Ecgbehrt and Æthelwulf              Kent, Wessex
## 83      Ecgbehrt and Æthelwulf      Kent, Mercia, Wessex
## 84      Ecgbehrt and Æthelwulf              Kent, Wessex
## 85    Æthelwulf and Æðelstan I              Kent, Wessex
## 86                   Æthelwulf              Kent, Wessex
## 87 Æthelwulf and Æðelberht III              Kent, Wessex
## 88               Æðelberht III              Kent, Wessex
## 89                  Æthelred I              Kent, Wessex
## 95                       Oswiu       Mercia, Northumbria

Similarly, it was quite common for power over a kingdom to be shared between several people, rather than having a single ruler. (This was especially common when a powerful king had several sons.) We can find these instances by looking for either a comma or the word “and” in the name column. This time, since we are looking for two things, it is easier to specify a regular expression rather than a fixed string. The pipe character, |, has the same meaning in regular expressions as it does in R: it means “or.”

In this next example, to prevent excessive output we just return the name column and ignore missing values (with is.na):

multiple_rulers <- str_detect(english_monarchs$name, ",|and")
english_monarchs$name[multiple_rulers & !is.na(multiple_rulers)]
##  [1] Sigeberht and Ecgric
##  [2] Hun, Beonna and Alberht
##  [3] Offa and Ecgfrith
##  [4] Cœnwulf and Cynehelm
##  [5] Sighere and Sebbi
##  [6] Sigeheard and Swaefred
##  [7] Eorcenberht and Eormenred
##  [8] Oswine, Swæfbehrt, Swæfheard
##  [9] Swæfbehrt, Swæfheard, Wihtred
## [10] Æðelberht II, Ælfric and Eadberht I
## [11] Æðelberht II and Eardwulf
## [12] Eadberht II, Eanmund and Sigered
## [13] Heaberht and Ecgbehrt II
## [14] Ecgbehrt and Æthelwulf
## [15] Ecgbehrt and Æthelwulf
## [16] Ecgbehrt and Æthelwulf
## [17] Æthelwulf and Æðelstan I
## [18] Æthelwulf and Æðelberht III
## [19] Penda and Eowa
## [20] Penda and Peada
## [21] Æthelred, Lord of the Mercians
## [22] Æthelflæd, Lady of the Mercians
## [23] Ælfwynn, Second Lady of the Mercians
## [24] Hálfdan and Eowils
## [25] Noðhelm and Watt
## [26] Noðhelm and Bryni
## [27] Noðhelm and Osric
## [28] Noðhelm and Æðelstan
## [29] Ælfwald, Oslac and Osmund
## [30] Ælfwald, Ealdwulf, Oslac and Osmund
## [31] Ælfwald, Ealdwulf, Oslac, Osmund and Oswald
## [32] Cenwalh and Seaxburh
## 211 Levels: Adda Æðelbehrt Æðelberht I ... Wulfhere

If we wanted to split the name column into individual rulers, then we could use str_split (or strsplit from base R, which does the same thing) in much the same way. str_split accepts a vector and returns a list, since each input string can be split into a vector of possibly differing lengths. If each input must return the same number of splits, we could use str_split_fixed instead, which returns a matrix. The output shows the first few examples of multiple rulers:

individual_rulers <- str_split(english_monarchs$name, ", | and ")
head(individual_rulers[sapply(individual_rulers, length) > 1])
## [[1]]
## [1] "Sigeberht" "Ecgric"
##
## [[2]]
## [1] "Hun"     "Beonna"  "Alberht"
##
## [[3]]
## [1] "Offa"     "Ecgfrith"
##
## [[4]]
## [1] "Cœnwulf"  "Cynehelm"
##
## [[5]]
## [1] "Sighere" "Sebbi"
##
## [[6]]
## [1] "Sigeheard" "Swaefred"

Many of the Anglo-Saxon rulers during this period had Old English characters in their names, like “æ” (“ash”), which represents “ae,” or “ð” and “þ” (“eth” and “thorn,” respectively), which both represent “th.” The exact spelling of each ruler’s name isn’t standardized in many cases, but to identify a particular ruler it is necessary to be consistent.

Let’s take a look at how many times th, ð, and þ are used to form the letters “th.” We can count the number of times each one occurs in each name using str_count, then sum over all rulers to calculate the total number of occurrences:

th <- c("th", "ð", "þ
sapply(         #can also use laply from plyr
  th,
  function(th)
  {
    sum(str_count(english_monarchs$name, th))
  }
)
## th  ð  þ
## 74 26  7

It looks like the standard modern Latin spelling is most common in this dataset. If we want to replace the eths and thorns, we can use str_replace_all. (A variant function, str_replace, replaces only the first match.) Placing eth and thorn in square brackets means “match either of these characters”:

english_monarchs$new_name <- str_replace_all(english_monarchs$name, "[ðþ]", "th")

This sort of trick can be very useful for cleaning up levels of a categorical variable. For example, genders can be specified in several ways in English, but we usually only want two of them. In the next example, we match on a string that starts with (^) “m” and is followed by an optional (?) “ale”, which ends the string ($):

gender <- c(
  "MALE", "Male", "male", "M", "FEMALE",
  "Female", "female", "f", NA
)
clean_gender <- str_replace(
  gender,
  ignore.case("^m(ale)?$"),
  "Male"
)
(clean_gender <- str_replace(
  clean_gender,
  ignore.case("^f(emale)?$"),
  "Female"
))
## [1] "Male"   "Male"   "Male"   "Male"   "Female" "Female" "Female" "Female"
## [9] NA

Manipulating Data Frames

Much of the task of cleaning data involves manipulating data frames to get them into the desired form. We’ve already seen indexing and the subset function for selecting a subset of a data frame. Other common tasks include augmenting a data frame with additional columns (or replacing existing columns), dealing with missing values, and converting between the wide and long forms of a data frame. There are several functions available for adding or replacing columns in a data frame.

Adding and Replacing Columns

Suppose we want to add a column to the english_monarchs data frame denoting the number of years the rulers were in power. We can use standard assignment to achieve this:

english_monarchs$length.of.reign.years <-
  english_monarchs$end.of.reign - english_monarchs$start.of.reign

This works, but the repetition of the data frame variable names makes this a lot of effort to type and to read. The with function makes things easier by letting you call variables directly. It takes a data frame[42] and an expression to evaluate:

english_monarchs$length.of.reign.years <- with(
  english_monarchs,
  end.of.reign - start.of.reign
)

The within function works in a similar way, but returns the whole data frame:

english_monarchs <- within(
  english_monarchs,
  {
    length.of.reign.years <- end.of.reign - start.of.reign
  }
)

Although within requires more effort in this example, it becomes more useful if you want to change multiple columns:

english_monarchs <- within(
  english_monarchs,
  {
    length.of.reign.years <- end.of.reign - start.of.reign
    reign.was.more.than.30.years <- length.of.reign.years > 30
  }
)

A good heuristic is that if you are creating or changing one column, then use with; if you want to manipulate several columns at once, then use within.

An alternative approach is taken by the mutate function in the plyr package, which accepts new and revised columns as name-value pairs:[43]

english_monarchs <- mutate(
  english_monarchs,
  length.of.reign.years        = end.of.reign - start.of.reign,
  reign.was.more.than.30.years = length.of.reign.years > 30
)

Dealing with Missing Values

The red deer dataset that we saw in the previous chapter contains measurements of the endocranial volume for each deer using four different techniques. For some but not all of the deer, a second measurement was taken to test the repeatability of the technique. This means that some of the rows have missing values. The complete.cases function tells us which rows are free of missing values:

data("deer_endocranial_volume", package = "learningr")
has_all_measurements <- complete.cases(deer_endocranial_volume)
deer_endocranial_volume[has_all_measurements, ]
##    SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
## 7     C120   346     335   1250          289    346      330    1264
## 8      C25   302     295   1011          250    303      295    1009
## 9       F7   379     360   1621          347    375      365    1647
## 10     B12   410     400   1740          387    413      395    1728
## 11     B17   405     395   1652          356    408      395    1639
## 12     B18   391     370   1835          419    394      375    1825
## 13      J7   416     405   1834          408    417      405    1876
## 15      A4   336     330   1224          283    345      330    1192
## 20      K2   349     355   1239          286    354      365    1243

The na.omit function provides a shortcut to this, removing any rows of a data frame where there are missing values:[44]

na.omit(deer_endocranial_volume)
##    SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
## 7     C120   346     335   1250          289    346      330    1264
## 8      C25   302     295   1011          250    303      295    1009
## 9       F7   379     360   1621          347    375      365    1647
## 10     B12   410     400   1740          387    413      395    1728
## 11     B17   405     395   1652          356    408      395    1639
## 12     B18   391     370   1835          419    394      375    1825
## 13      J7   416     405   1834          408    417      405    1876
## 15      A4   336     330   1224          283    345      330    1192
## 20      K2   349     355   1239          286    354      365    1243

By contrast, na.fail will throw an error if your data frame contains any missing values:

na.fail(deer_endocranial_volume)

Both these functions can accept vectors as well, removing missing values or failing, as in the data frame case.

Note

You can use multiple imputation to fill in missing values in a statistically sound way. This is beyond the scope of the book, but the mice and mix packages are good places to start.

Converting Between Wide and Long Form

The red deer dataset contains measurements of the volume of deer skulls obtained in four different ways. Each measurement for a particular deer is given in its own column. (For simplicity, let’s ignore the columns for repeat measurements.) This is known as the wide form of a data frame:

deer_wide <- deer_endocranial_volume[, 1:5]

An alternative point of view is that each skull measurement is the same type of thing (a measurement), just a different measurement. So, a different way of representing the data would be to have four rows for each deer, with a column for the skull ID, as before (so each value would be repeated four times), a column containing all the measurements, and a factor column explaining what type of measurement is contained in that particular row. This is called the long form of a data frame.

There is a function in base R for converting between wide and long form, called reshape. It’s very powerful, but not entirely intuitive; a better alternative is to use the functionality of the reshape2 package.

The melt function available in this package converts from wide form to long.[45] We choose SkullID as the ID column (with everything else being classed as a measurement):

library(reshape2)
deer_long <- melt(deer_wide, id.vars = "SkullID")
head(deer_long)
##   SkullID variable value
## 1   DIC44    VolCT   389
## 2     B11    VolCT   389
## 3   DIC90    VolCT   352
## 4   DIC83    VolCT   388
## 5  DIC787    VolCT   375
## 6 DIC1573    VolCT   325

You can, alternatively, supply the measure.vars argument, which is all the columns that aren’t included in id.vars. In this case it is more work, but it can be useful if you have many ID variables and few measurement variables:

melt(deer_wide, measure.vars = c("VolCT", "VolBead", "VolLWH", "VolFinarelli"))

The dcast function converts back from long to wide and returns the result as a data frame (the related function acast returns a vector, matrix, or array):

deer_wide_again <- dcast(deer_long, SkullID ~ variable)

Our reconstituted dataset, deer_wide_again, is identical to the original, deer_wide, except that it is now ordered alphabetically by SkullID.

Note

Spreadsheet aficionados might note that acast and dcast are effectively creating pivot tables.

Using SQL

The sqldf package provides a way of manipulating data frames using SQL. In general, native R functions are more concise and readable than SQL code, but if you come from a database background this package can ease your transition to R:

install.packages("sqldf")

The next example compares the native R and sqldf versions of a subsetting query:

library(sqldf)
## Loading required package: DBI
## Loading required package: gsubfn
## Loading required package: proto
## Loading required namespace: tcltk
## Loading required package: chron
## Loading required package: RSQLite
## Loading required package: RSQLite.extfuns
subset(
  deer_endocranial_volume,
  VolCT > 400 | VolCT2 > 400,
  c(VolCT, VolCT2)
)
##    VolCT VolCT2
## 10   410    413
## 11   405    408
## 13   416    417
## 16   418     NA
query <-
  "SELECT
      VolCT,
      VolCT2
    FROM
      deer_endocranial_volume
    WHERE
      VolCT > 400 OR
      VolCT2 > 400"
sqldf(query)
## Loading required package: tcltk
##   VolCT VolCT2
## 1   410    413
## 2   405    408
## 3   416    417
## 4   418     NA

Sorting

It’s often useful to have numeric data in size order, since the interesting values are often the extremes. The sort function sorts vectors from smallest to largest (or largest to smallest):[46]

x <- c(2, 32, 4, 16, 8)
sort(x)
## [1]  2  4  8 16 32
sort(x, decreasing = TRUE)
## [1] 32 16  8  4  2

Strings can also be sorted, but the sort order depends upon locale. Usually letters are ordered from “a” through to “z,” but there are oddities: in Estonian, “z” comes after “s” and before “t,” for example. More of these quirks are listed in the ?Comparison help page. In an English or North American locale, you’ll see results like this:

sort(c("I", "shot", "the", "city", "sheriff"))
## [1] "city"    "I"       "sheriff" "shot"    "the"

The order function is a kind of inverse to sort. The ith element of the order contains the index of the element of x that will end up in the ith position after sorting. That takes a bit of getting your head around, but mostly what you need to know is that x[order(x)] returns the same result as sort(x):

order(x)
## [1] 1 3 5 4 2
x[order(x)]
## [1]  2  4  8 16 32
identical(sort(x), x[order(x)])
## [1] TRUE

order is most useful for sorting data frames, where sort cannot be used directly. For example, to sort the english_monarchs data frame by the year of the start of the reign, we can use:

year_order <- order(english_monarchs$start.of.reign)
english_monarchs[year_order, ]

The arrange function from the plyr package provides a one-line alternative for ordering data frames:

arrange(english_monarchs, start.of.reign)

The rank function gives the rank of each element in a dataset, providing a few ways of dealing with ties:

(x <- sample(3, 7, replace = TRUE))
## [1] 1 2 1 3 3 3 2
rank(x)
## [1] 1.5 3.5 1.5 6.0 6.0 6.0 3.5
rank(x, ties.method = "first")
## [1] 1 3 2 5 6 7 4

Functional Programming

Several concepts from functional programming languages like LISP and Haskell have been introduced into R. You don’t need to know anything at all about functional programming to use them;[47] you just need to know that these functions can be useful for manipulating data.

The Negate function accepts a predicate (that is, a function that returns a logical vector), and returns another predicate that does the opposite.[48] It returns TRUE when the input returns FALSE and FALSE when the input returns TRUE:

ct2 <- deer_endocranial_volume$VolCT2  #for convenience of typing
isnt.na <- Negate(is.na)
identical(isnt.na(ct2), !is.na(ct2))
## [1] TRUE

Filter takes a function that returns a logical vector and an input vector, and returns only those values where the function returns TRUE:

Filter(isnt.na, ct2)
## [1] 346 303 375 413 408 394 417 345 354

The Position function behaves a little bit like which, which we saw in Vectors in Chapter 4. It returns the first index where applying a predicate to a vector returns TRUE:

Position(isnt.na, ct2)
## [1] 7

Find is similar to Position, but it returns the first value rather than the first index:

Find(isnt.na, ct2)
## [1] 346

Map applies a function element-wise to its inputs. It’s just a wrapper to mapply, with SIMPLIFY = FALSE. In this next example, we retrieve the average measurement using each method for each deer in the red deer dataset. First, we need a function to pass to Map to find the volume of each deer skull:

get_volume <- function(ct, bead, lwh, finarelli, ct2, bead2, lwh2)
{
  #If there is a second measurement, take the average
  if(!is.na(ct2))
  {
    ct <- (ct + ct2) / 2
    bead <- (bead + bead2) / 2
    lwh <- (lwh + lwh2) / 2
  }
  #Divide lwh by 4 to bring it in line with other measurements
  c(ct = ct, bead = bead, lwh.4 = lwh / 4, finarelli = finarelli)
}

Then Map behaves like mapply—it takes a function and then each argument to pass to that function:

measurements_by_deer <- with(
  deer_endocranial_volume,
  Map(
    get_volume,
    VolCT,
    VolBead,
    VolLWH,
    VolFinarelli,
    VolCT2,
    VolBead2,
    VolLWH2
  )
)
head(measurements_by_deer)
## [[1]]
##        ct      bead     lwh.4 finarelli
##       389       375       371       337
##
## [[2]]
##        ct      bead     lwh.4 finarelli
##     389.0     370.0     430.5     377.0
##
## [[3]]
##        ct      bead     lwh.4 finarelli
##     352.0     345.0     373.8     328.0
##
## [[4]]
##        ct      bead     lwh.4 finarelli
##     388.0     370.0     420.8     377.0
##
## [[5]]
##        ct      bead     lwh.4 finarelli
##     375.0     355.0     364.5     328.0
##
## [[6]]
##        ct      bead     lwh.4 finarelli
##     325.0     320.0     340.8     291.0

The Reduce function turns a binary function into one that accepts multiple inputs. For example, the + operator calculates the sum of two numbers, but the sum function calculates the sum of multiple inputs. sum(a, b, c, d, e) is (roughly) equivalent to Reduce("+", list(a, b, c, d, e)).

We can define a simple binary function that calculates the (parallel) maximum of two inputs:

pmax2 <- function(x, y) ifelse(x >= y, x, y)

If we reduce this function, then it will accept a list of many inputs (like the pmax function in base R does):

Reduce(pmax2, measurements_by_deer)
##        ct      bead     lwh.4 finarelli
##     418.0     405.0     463.8     419.0

One proviso is that Reduce repeatedly calls the binary function on pairs of inputs, so:

Reduce("+", list(a, b, c, d, e))

is the same as:

((((a + b) + c) + d) + e)

This means that you can’t use it for something like calculating the mean, since:

mean(mean(mean(mean(a, b), c), d), e) != mean(a, b, c, d, e)

Summary

  • The stringr package is useful for manipulating strings.
  • Columns of a data frame can be added, subtracted, or manipulated.
  • Data frames can exist in wide or long form.
  • Vectors can be sorted, ranked, and ordered.
  • R has some functional programming capabilities, including Map and Reduce.

Test Your Knowledge: Quiz

Question 13-1
How would you count the number of times the word “thou” appears in Shakespeare’s The Tempest?
Question 13-2
Name as many functions as you can think of for adding columns to data frames.
Question 13-3
What is the opposite of melting?
Question 13-4
How would you reorder a data frame by one if its columns?
Question 13-5
How would you find the first positive number in a vector?

Test Your Knowledge: Exercises

Exercise 13-1
  1. Load the hafu dataset from the learningr package. In the Father and Mother columns, some values have question marks after the country name, indicating that the author was uncertain about the nationality of the parent. Create two new columns in the hafu data frame, denoting whether or not there was a question mark in the Father or Mother column, respectively.
  2. Remove those question marks from the Father and Mother columns. [10]
Exercise 13-2
The hafu dataset has separate columns for the nationality of each parent. Convert the data frame from wide form to long form, with a single column for the parents’ nationality and a column indicating which parent the nationality refers to. [5]
Exercise 13-3
Write a function that returns the 10 most common values in a vector, along with their counts. Try the function on some columns from the hafu dataset. [10]


[41] See the assertive package for some pre-canned regular expressions for this purpose.

[42] Or an environment.

[43] The transform function from base R is a precursor to mutate, and is now obsolete.

[44] na.exclude does the same thing as na.omit; their dual existence is mostly for legacy purposes.

[45] The terminology of “melting” and its opposite, “casting,” is borrowed from the steel industry.

[46] Sorting algorithm geeks may like to note that the sort function uses a shellsort by default, but you can use method = "quick" to use a quicksort instead. Radix sorting is also available for factors.

[47] For the curious, there’s a good introduction at wordIQ.com.

[48] Technically, a predicate is a function that returns a single logical value, so we’re abusing the term, but the word for the vector equivalent hasn’t been coined yet. I rather like the Schwarzenegger-esque “predicator,” but until that catches on, we’ll use plain old predicate.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset