No matter what format you are given data in, it’s almost always the wrong one for what you want to do with it, and no matter who gave it to you, it’s almost always dirty. Cleaning and transforming data may not be the fun part of data analysis, but you’ll probably spend more of your life than you care to doing it. Fortunately, R has a wide selection of tools to help with these tasks.
After reading this chapter, you should:
Back in Chapter 7, we looked at some simple string manipulation tasks like combining strings together using paste
, and extracting sections of a string using substring
.
One really common problem is when logical values have been encoded in a way that R doesn’t understand. In the alpe_d_huez
cycling dataset, the DrugUse
column (denoting whether or not allegations of drug use have been made about each rider’s performance), values have been encoded as "Y"
and "N"
rather than TRUE
or FALSE
. For this sort of simple matching, we can directly replace each string with the correct logical value:
yn_to_logical<-
function
(
x)
{
y<-
rep.int(
NA
,
length(
x))
y[
x==
"Y"
]
<-
TRUE
y[
x==
"N"
]
<-
FALSE
y}
Setting values to NA
by default lets us deal with strings that don’t match "Y"
or "N"
. We call the function in the obvious way:
alpe_d_huez$
DrugUse<-
yn_to_logical(
alpe_d_huez$
DrugUse)
This direct replacement of one string with another doesn’t scale very well to having lots of choices of string. If you have ten thousand possible inputs, then a function to change each one would be very hard to write without errors, and even harder to maintain.
Fortunately, much more sophisticated manipulation is possible, and it is relatively easy to detect, extract, and replace parts of strings that match a pattern. R has a suite of built-in functions for handling these tasks, (loosely) based upon the Unix grep
tool. They accept a string to manipulate and a regular expression to match. As mentioned in Chapter 1, regular expressions are patterns that provide a flexible means of describing the contents of a string. They are very useful for matching complex string-data types like phone numbers or email addresses.[41]
The grep
, grepl
, and regexpr
functions all find strings that match a pattern, and sub
and gsub
replace matching strings. In classic R style, these functions are meticulously correct and very powerful, but suffer from funny naming conventions, quirky argument ordering, and odd return values that have arisen for historical reasons. Fortunately, in the same way that plyr
provides a consistent wrapper around apply
functions and lubridate
provides a consistent wrapper around the date-time functions, the stringr
package provides a consistent wrapper around the string manipulation functions. The difference is that while you will occasionally need to use a base apply
or date-time function, stringr
is advanced enough that you shouldn’t need to bother with grep
at all. So, take a look at the ?grep
help page, but don’t devote too much of your brain to it.
These next examples use the english_monarchs
dataset from the learningr
package. It contains the names and dates of rulers from post-Roman times (in the fifth century CE), when England was split into seven regions known as the heptarchy, until England took over Ireland in the early thirteenth century:
data(
english_monarchs,
package=
"learningr"
)
head(
english_monarchs)
## name house start.of.reign end.of.reign domain ## 1 Wehha Wuffingas NA 571 East Anglia ## 2 Wuffa Wuffingas 571 578 East Anglia ## 3 Tytila Wuffingas 578 616 East Anglia ## 4 Rædwald Wuffingas 616 627 East Anglia ## 5 Eorpwald Wuffingas 627 627 East Anglia ## 6 Ricberht Wuffingas 627 630 East Anglia ## length.of.reign.years reign.was.more.than.30.years ## 1 NA NA ## 2 7 FALSE ## 3 38 TRUE ## 4 11 FALSE ## 5 0 FALSE ## 6 3 FALSE
One of the problems with history is that there is an awful lot of it. Fortunately, odd or messy data can be a really good indicator of the interesting bits of history, so we can narrow it down to the good stuff. For example, although there were seven territories that came together to form England, their boundaries were far from fixed, and sometimes one kingdom would conquer another. We can find these convergences by searching for commas in the domain
column. To detect a pattern, we use the str_detect
function. The fixed
function tells str_detect
that we are looking for a fixed string (a comma) rather than a regular expression. str_detect
returns a logical vector that we can use for an index:
library(
stringr)
multiple_kingdoms<-
str_detect(
english_monarchs$
domain,
fixed(
","
))
english_monarchs[
multiple_kingdoms,
c(
"name"
,
"domain"
)]
## name domain ## 17 Offa East Anglia, Mercia ## 18 Offa East Anglia, Kent, Mercia ## 19 Offa and Ecgfrith East Anglia, Kent, Mercia ## 20 Ecgfrith East Anglia, Kent, Mercia ## 22 Cœnwulf East Anglia, Kent, Mercia ## 23 Cœnwulf and Cynehelm East Anglia, Kent, Mercia ## 24 Cœnwulf East Anglia, Kent, Mercia ## 25 Ceolwulf East Anglia, Kent, Mercia ## 26 Beornwulf East Anglia, Mercia ## 82 Ecgbehrt and Æthelwulf Kent, Wessex ## 83 Ecgbehrt and Æthelwulf Kent, Mercia, Wessex ## 84 Ecgbehrt and Æthelwulf Kent, Wessex ## 85 Æthelwulf and Æðelstan I Kent, Wessex ## 86 Æthelwulf Kent, Wessex ## 87 Æthelwulf and Æðelberht III Kent, Wessex ## 88 Æðelberht III Kent, Wessex ## 89 Æthelred I Kent, Wessex ## 95 Oswiu Mercia, Northumbria
Similarly, it was quite common for power over a kingdom to be shared between several people, rather than having a single ruler. (This was especially common when a powerful king had several sons.) We can find these instances by looking for either a comma or the word “and” in the name
column. This time, since we are looking for two things, it is easier to specify a regular expression rather than a fixed string. The pipe character, |
, has the same meaning in regular expressions as it does in R: it means “or.”
In this next example, to prevent excessive output we just return the name
column and ignore missing values (with is.na
):
multiple_rulers<-
str_detect(
english_monarchs$
name,
",|and"
)
english_monarchs$
name[
multiple_rulers&
!is.na(
multiple_rulers)]
## [1] Sigeberht and Ecgric ## [2] Hun, Beonna and Alberht ## [3] Offa and Ecgfrith ## [4] Cœnwulf and Cynehelm ## [5] Sighere and Sebbi ## [6] Sigeheard and Swaefred ## [7] Eorcenberht and Eormenred ## [8] Oswine, Swæfbehrt, Swæfheard ## [9] Swæfbehrt, Swæfheard, Wihtred ## [10] Æðelberht II, Ælfric and Eadberht I ## [11] Æðelberht II and Eardwulf ## [12] Eadberht II, Eanmund and Sigered ## [13] Heaberht and Ecgbehrt II ## [14] Ecgbehrt and Æthelwulf ## [15] Ecgbehrt and Æthelwulf ## [16] Ecgbehrt and Æthelwulf ## [17] Æthelwulf and Æðelstan I ## [18] Æthelwulf and Æðelberht III ## [19] Penda and Eowa ## [20] Penda and Peada ## [21] Æthelred, Lord of the Mercians ## [22] Æthelflæd, Lady of the Mercians ## [23] Ælfwynn, Second Lady of the Mercians ## [24] Hálfdan and Eowils ## [25] Noðhelm and Watt ## [26] Noðhelm and Bryni ## [27] Noðhelm and Osric ## [28] Noðhelm and Æðelstan ## [29] Ælfwald, Oslac and Osmund ## [30] Ælfwald, Ealdwulf, Oslac and Osmund ## [31] Ælfwald, Ealdwulf, Oslac, Osmund and Oswald ## [32] Cenwalh and Seaxburh ## 211 Levels: Adda Æðelbehrt Æðelberht I ... Wulfhere
If we wanted to split the name
column into individual rulers, then we could use str_split
(or strsplit
from base R, which does the same thing) in much the same way. str_split
accepts a vector and returns a list, since each input string can be split into a vector of possibly differing lengths. If each input must return the same number of splits, we could use str_split_fixed
instead, which returns a matrix. The output shows the first few examples of multiple rulers:
individual_rulers<-
str_split(
english_monarchs$
name,
", | and "
)
head(
individual_rulers[
sapply(
individual_rulers,
length)
>
1
])
## [[1]] ## [1] "Sigeberht" "Ecgric" ## ## [[2]] ## [1] "Hun" "Beonna" "Alberht" ## ## [[3]] ## [1] "Offa" "Ecgfrith" ## ## [[4]] ## [1] "Cœnwulf" "Cynehelm" ## ## [[5]] ## [1] "Sighere" "Sebbi" ## ## [[6]] ## [1] "Sigeheard" "Swaefred"
Many of the Anglo-Saxon rulers during this period had Old English characters in their names, like “æ” (“ash”), which represents “ae,” or “ð” and “þ” (“eth” and “thorn,” respectively), which both represent “th.” The exact spelling of each ruler’s name isn’t standardized in many cases, but to identify a particular ruler it is necessary to be consistent.
Let’s take a look at how many times th
, ð
, and þ
are used to form the letters “th.” We can count the number of times each one occurs in each name using str_count
, then sum
over all rulers to calculate the total number of occurrences:
th<-
c(
"th"
,
"ð"
,
"
þ
sapply(
#can also use laply from plyr
th,
function
(
th)
{
sum(
str_count(
english_monarchs$
name,
th))
}
)
## th ð þ ## 74 26 7
It looks like the standard modern Latin spelling is most common in this dataset. If we want to replace the eths and thorns, we can use str_replace_all
. (A variant function, str_replace
, replaces only the first match.) Placing eth and thorn in square brackets means “match either of these characters”:
english_monarchs$
new_name<-
str_replace_all(
english_monarchs$
name,
"[ðþ]"
,
"th"
)
This sort of trick can be very useful for cleaning up levels of a categorical variable. For example, genders can be specified in several ways in English, but we usually only want two of them. In the next example, we match on a string that starts with (^
) “m” and is followed by an optional (?
) “ale”, which ends the string ($
):
gender<-
c(
"MALE"
,
"Male"
,
"male"
,
"M"
,
"FEMALE"
,
"Female"
,
"female"
,
"f"
,
NA
)
clean_gender<-
str_replace(
gender,
ignore.case(
"^m(ale)?$"
),
"Male"
)
(
clean_gender<-
str_replace(
clean_gender,
ignore.case(
"^f(emale)?$"
),
"Female"
))
## [1] "Male" "Male" "Male" "Male" "Female" "Female" "Female" "Female" ## [9] NA
Much of the task of cleaning data involves manipulating data frames to get them into the desired form. We’ve already seen indexing and the subset
function for selecting a subset of a data frame. Other common tasks include augmenting a data frame with additional columns (or replacing existing columns), dealing with missing values, and converting between the wide and long forms of a data frame. There are several functions available for adding or replacing columns in a data frame.
Suppose we want to add a column to the english_monarchs
data frame denoting the number of years the rulers were in power. We can use standard assignment to achieve this:
english_monarchs$
length.of.reign.years<-
english_monarchs$
end.of.reign-
english_monarchs$
start.of.reign
This works, but the repetition of the data frame variable names makes this a lot of effort to type and to read. The with
function makes things easier by letting you call variables directly. It takes a data frame[42] and an expression to evaluate:
english_monarchs$
length.of.reign.years<-
with(
english_monarchs,
end.of.reign-
start.of.reign)
The within
function works in a similar way, but returns the whole data frame:
english_monarchs<-
within(
english_monarchs,
{
length.of.reign.years<-
end.of.reign-
start.of.reign}
)
Although within
requires more effort in this example, it becomes more useful if you want to change multiple columns:
english_monarchs<-
within(
english_monarchs,
{
length.of.reign.years<-
end.of.reign-
start.of.reign reign.was.more.than.30.years<-
length.of.reign.years>
30
}
)
A good heuristic is that if you are creating or changing one column, then use with
; if you want to manipulate several columns at once, then use within
.
An alternative approach is taken by the mutate
function in the plyr
package, which accepts new and revised columns as name-value pairs:[43]
english_monarchs<-
mutate(
english_monarchs,
length.of.reign.years=
end.of.reign-
start.of.reign,
reign.was.more.than.30.years=
length.of.reign.years>
30
)
The red deer dataset that we saw in the previous chapter contains measurements of the endocranial volume for each deer using four different techniques. For some but not all of the deer, a second measurement was taken to test the repeatability of the technique. This means that some of the rows have missing values. The complete.cases
function tells us which rows are free of missing values:
data(
"deer_endocranial_volume"
,
package=
"learningr"
)
has_all_measurements<-
complete.cases(
deer_endocranial_volume)
deer_endocranial_volume[
has_all_measurements,
]
## SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2 ## 7 C120 346 335 1250 289 346 330 1264 ## 8 C25 302 295 1011 250 303 295 1009 ## 9 F7 379 360 1621 347 375 365 1647 ## 10 B12 410 400 1740 387 413 395 1728 ## 11 B17 405 395 1652 356 408 395 1639 ## 12 B18 391 370 1835 419 394 375 1825 ## 13 J7 416 405 1834 408 417 405 1876 ## 15 A4 336 330 1224 283 345 330 1192 ## 20 K2 349 355 1239 286 354 365 1243
The na.omit
function provides a shortcut to this, removing any rows of a data frame where there are missing values:[44]
na.omit(
deer_endocranial_volume)
## SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2 ## 7 C120 346 335 1250 289 346 330 1264 ## 8 C25 302 295 1011 250 303 295 1009 ## 9 F7 379 360 1621 347 375 365 1647 ## 10 B12 410 400 1740 387 413 395 1728 ## 11 B17 405 395 1652 356 408 395 1639 ## 12 B18 391 370 1835 419 394 375 1825 ## 13 J7 416 405 1834 408 417 405 1876 ## 15 A4 336 330 1224 283 345 330 1192 ## 20 K2 349 355 1239 286 354 365 1243
By contrast, na.fail
will throw an error if your data frame contains any missing values:
na.fail(
deer_endocranial_volume)
Both these functions can accept vectors as well, removing missing values or failing, as in the data frame case.
You can use multiple imputation to fill in missing values in a statistically sound way. This is beyond the scope of the book, but the mice
and mix
packages are good places to start.
The red deer dataset contains measurements of the volume of deer skulls obtained in four different ways. Each measurement for a particular deer is given in its own column. (For simplicity, let’s ignore the columns for repeat measurements.) This is known as the wide form of a data frame:
deer_wide<-
deer_endocranial_volume[,
1
:5
]
An alternative point of view is that each skull measurement is the same type of thing (a measurement), just a different measurement. So, a different way of representing the data would be to have four rows for each deer, with a column for the skull ID, as before (so each value would be repeated four times), a column containing all the measurements, and a factor column explaining what type of measurement is contained in that particular row. This is called the long form of a data frame.
There is a function in base R for converting between wide and long form, called reshape
. It’s very powerful, but not entirely intuitive; a better alternative is to use the functionality of the reshape2
package.
The melt
function available in this package converts from wide form to long.[45] We choose SkullID
as the ID column (with everything else being classed as a measurement):
library(
reshape2)
deer_long<-
melt(
deer_wide,
id.vars=
"SkullID"
)
head(
deer_long)
## SkullID variable value ## 1 DIC44 VolCT 389 ## 2 B11 VolCT 389 ## 3 DIC90 VolCT 352 ## 4 DIC83 VolCT 388 ## 5 DIC787 VolCT 375 ## 6 DIC1573 VolCT 325
You can, alternatively, supply the measure.vars
argument, which is all the columns that aren’t included in id.vars
. In this case it is more work, but it can be useful if you have many ID variables and few measurement variables:
melt(
deer_wide,
measure.vars=
c(
"VolCT"
,
"VolBead"
,
"VolLWH"
,
"VolFinarelli"
))
The dcast
function converts back from long to wide and returns the result as a data frame (the related function acast
returns a vector, matrix, or array):
deer_wide_again<-
dcast(
deer_long,
SkullID ~ variable)
Our reconstituted dataset, deer_wide_again
, is identical to the original, deer_wide
, except that it is now ordered alphabetically by SkullID
.
Spreadsheet aficionados might note that acast
and dcast
are effectively creating pivot tables.
The sqldf
package provides a way of manipulating data frames using SQL. In general, native R functions are more concise and readable than SQL code, but if you come from a database background this package can ease your transition to R:
install.packages(
"sqldf"
)
The next example compares the native R and sqldf
versions of a subsetting query:
library(
sqldf)
## Loading required package: DBI
## Loading required package: gsubfn
## Loading required package: proto
## Loading required namespace: tcltk
## Loading required package: chron
## Loading required package: RSQLite
## Loading required package: RSQLite.extfuns
subset(
deer_endocranial_volume,
VolCT>
400
|
VolCT2>
400
,
c(
VolCT,
VolCT2)
)
## VolCT VolCT2 ## 10 410 413 ## 11 405 408 ## 13 416 417 ## 16 418 NA
query<-
"SELECT
VolCT,
VolCT2
FROM
deer_endocranial_volume
WHERE
VolCT > 400 OR
VolCT2 > 400"
sqldf(
query)
## Loading required package: tcltk
## VolCT VolCT2 ## 1 410 413 ## 2 405 408 ## 3 416 417 ## 4 418 NA
It’s often useful to have numeric data in size order, since the interesting values are often the extremes. The sort
function sorts vectors from smallest to largest (or largest to smallest):[46]
x<-
c(
2
,
32
,
4
,
16
,
8
)
sort(
x)
## [1] 2 4 8 16 32
sort(
x,
decreasing=
TRUE
)
## [1] 32 16 8 4 2
Strings can also be sorted, but the sort order depends upon locale. Usually letters are ordered from “a” through to “z,” but there are oddities: in Estonian, “z” comes after “s” and before “t,” for example. More of these quirks are listed in the ?Comparison
help page. In an English or North American locale, you’ll see results like this:
sort(
c(
"I"
,
"shot"
,
"the"
,
"city"
,
"sheriff"
))
## [1] "city" "I" "sheriff" "shot" "the"
The order
function is a kind of inverse to sort
. The ith element of the order contains the index of the element of x
that will end up in the ith position after sorting. That takes a bit of getting your head around, but mostly what you need to know is that x[order(x)]
returns the same result as sort(x)
:
order(
x)
## [1] 1 3 5 4 2
x[
order(
x)]
## [1] 2 4 8 16 32
identical(
sort(
x),
x[
order(
x)])
## [1] TRUE
order
is most useful for sorting data frames, where sort
cannot be used directly. For example, to sort the english_monarchs
data frame by the year of the start of the reign, we can use:
year_order<-
order(
english_monarchs$
start.of.reign)
english_monarchs[
year_order,
]
The arrange
function from the plyr
package provides a one-line alternative for ordering data frames:
arrange(
english_monarchs,
start.of.reign)
The rank
function gives the rank of each element in a dataset, providing a few ways of dealing with ties:
(
x<-
sample(
3
,
7
,
replace=
TRUE
))
## [1] 1 2 1 3 3 3 2
rank(
x)
## [1] 1.5 3.5 1.5 6.0 6.0 6.0 3.5
rank(
x,
ties.method=
"first"
)
## [1] 1 3 2 5 6 7 4
Several concepts from functional programming languages like LISP and Haskell have been introduced into R. You don’t need to know anything at all about functional programming to use them;[47] you just need to know that these functions can be useful for manipulating data.
The Negate
function accepts a predicate (that is, a function that returns a logical vector), and returns another predicate that does the opposite.[48] It returns TRUE
when the input returns FALSE
and FALSE
when the input returns TRUE
:
ct2<-
deer_endocranial_volume$
VolCT2#for convenience of typing
isnt.na<-
Negate(
is.na)
identical(
isnt.na(
ct2),
!is.na(
ct2))
## [1] TRUE
Filter
takes a function that returns a logical vector and an input vector, and returns only those values where the function returns TRUE
:
Filter(
isnt.na,
ct2)
## [1] 346 303 375 413 408 394 417 345 354
The Position
function behaves a little bit like which
, which we saw in Vectors in Chapter 4. It returns the first index where applying a predicate to a vector returns TRUE
:
Position(
isnt.na,
ct2)
## [1] 7
Find
is similar to Position
, but it returns the first value rather than the first index:
Find(
isnt.na,
ct2)
## [1] 346
Map
applies a function element-wise to its inputs. It’s just a wrapper to mapply
, with SIMPLIFY = FALSE
. In this next example, we retrieve the average measurement using each method for each deer in the red deer dataset. First, we need a function to pass to Map
to find the volume of each deer skull:
get_volume<-
function
(
ct,
bead,
lwh,
finarelli,
ct2,
bead2,
lwh2)
{
#If there is a second measurement, take the average
if
(
!is.na(
ct2))
{
ct<-
(
ct+
ct2)
/
2
bead<-
(
bead+
bead2)
/
2
lwh<-
(
lwh+
lwh2)
/
2
}
#Divide lwh by 4 to bring it in line with other measurements
c(
ct=
ct,
bead=
bead,
lwh.4=
lwh/
4
,
finarelli=
finarelli)
}
Then Map
behaves like mapply
—it takes a function and then each argument to pass to that function:
measurements_by_deer<-
with(
deer_endocranial_volume,
Map(
get_volume,
VolCT,
VolBead,
VolLWH,
VolFinarelli,
VolCT2,
VolBead2,
VolLWH2)
)
head(
measurements_by_deer)
## [[1]] ## ct bead lwh.4 finarelli ## 389 375 371 337 ## ## [[2]] ## ct bead lwh.4 finarelli ## 389.0 370.0 430.5 377.0 ## ## [[3]] ## ct bead lwh.4 finarelli ## 352.0 345.0 373.8 328.0 ## ## [[4]] ## ct bead lwh.4 finarelli ## 388.0 370.0 420.8 377.0 ## ## [[5]] ## ct bead lwh.4 finarelli ## 375.0 355.0 364.5 328.0 ## ## [[6]] ## ct bead lwh.4 finarelli ## 325.0 320.0 340.8 291.0
The Reduce
function turns a binary function into one that accepts multiple inputs. For example, the +
operator calculates the sum of two numbers, but the sum
function calculates the sum of multiple inputs. sum(a, b, c, d, e)
is (roughly) equivalent to Reduce("+", list(a, b, c, d, e))
.
We can define a simple binary function that calculates the (parallel) maximum of two inputs:
pmax2<-
function
(
x,
y)
ifelse(
x>=
y,
x,
y)
If we reduce this function, then it will accept a list of many inputs (like the pmax
function in base R does):
Reduce(
pmax2,
measurements_by_deer)
## ct bead lwh.4 finarelli ## 418.0 405.0 463.8 419.0
One proviso is that Reduce
repeatedly calls the binary function on pairs of inputs, so:
Reduce(
"+"
,
list(
a,
b,
c,
d,
e))
is the same as:
((((
a+
b)
+
c)
+
d)
+
e)
This means that you can’t use it for something like calculating the mean, since:
mean(
mean(
mean(
mean(
a,
b),
c),
d),
e)
!=
mean(
a,
b,
c,
d,
e)
hafu
dataset from the learningr
package. In the Father
and Mother
columns, some values have question marks after the country name, indicating that the author was uncertain about the nationality of the parent. Create two new columns in the hafu
data frame, denoting whether or not there was a question mark in the Father
or Mother
column, respectively.
Father
and Mother
columns. [10]
hafu
dataset has separate columns for the nationality of each parent. Convert the data frame from wide form to long form, with a single column for the parents’ nationality and a column indicating which parent the nationality refers to. [5]
hafu
dataset. [10]
[41] See the assertive
package for some pre-canned regular expressions for this purpose.
[42] Or an environment.
[43] The transform
function from base R is a precursor to mutate
, and is now obsolete.
[44] na.exclude
does the same thing as na.omit
; their dual existence is mostly for legacy purposes.
[45] The terminology of “melting” and its opposite, “casting,” is borrowed from the steel industry.
[46] Sorting algorithm geeks may like to note that the sort
function uses a shellsort by default, but you can use method = "quick"
to use a quicksort instead. Radix sorting is also available for factors.
[47] For the curious, there’s a good introduction at wordIQ.com.
[48] Technically, a predicate is a function that returns a single logical value, so we’re abusing the term, but the word for the vector equivalent hasn’t been coined yet. I rather like the Schwarzenegger-esque “predicator,” but until that catches on, we’ll use plain old predicate.