Hour 11. Data Manipulation and Transformation


What You’ll Learn in This Hour:

Image Sorting

Image Setting and merging

Image Handling duplicate values

Image Restructuring data frames

Image Data Aggregation


In the previous hour, we walked through a variety of methods for reading data into R as well as exporting it. This included working with flat files, R’s .RData format, databases, and Microsoft Excel. However, reading data into R is only the start of the data analysis workflow. As data scientists and statisticians, we rarely get to control the structure and format of our data. In Hour 5, “Dates, Times, and Factors,” and Hour 6, “Common R Utility Functions,” you saw some useful functions for working with the format of your data. We looked at dates, times, factors, and missing data. We also looked at common functions for working with numeric and character data. Now we will look a little closer at the structure of our data.

Analysts will tend to quote all kinds of numbers for the proportion of a data analysis workflow that is taken up with data manipulation, or “data munging” as it is increasingly being referred to. However, one thing that most people agree on is that it takes more time than it should—and takes up significantly more time than the interesting analysis piece at the end! These days you can make a career out of being an expert data wrangler!

Several approaches to data manipulation in R have evolved over time. In this hour, we start by looking at what could be called “traditional” approaches to the data manipulation tasks of sorting, setting, and merging. We will then look at the popular packages reshape, reshape2, and tidyr for data restructuring. We will then continue the data manipulation theme into Hour 12, “Efficient Data Handling in R,” where we will look deeper at two of the most popular packages for data manipulation and aggregation, data.table and dplyr.

Sorting

In R we are rarely required to sort our data in order to use a particular function. Most functions do it for us if it’s needed. However, if we are calculating cumulative sums, analyzing time series, or if we just want to view our data in a way that makes sense to us, then we will need to sort the data ourselves. Base R contains a function named sort that enables us to easily sort vectors. By default, the function sorts vectors from low to high, though we can sort in descending order by specifying decreasing = TRUE.

> sort(airquality$Wind)[1:10]
 [1] 1.7 2.3 2.8 3.4 4.0 4.1 4.6 4.6 4.6 4.6

Unfortunately, the sort function only works with vectors, and it is useless to us if we want to sort data frames. To do so, we need to use the order function.

Sorting Data Frames

The order function returns a vector of positions or indices corresponding to the elements we would select if we were to order our data. Let’s create a simple numeric vector, myVec, and examine the output when we feed it to the order function:

> myVec <- c(63, 31, 48, 82, 51, 20, 72, 99, 84, 53)
> order(myVec)
 [1]  6  2  3  5 10  1  7  4  9  8

The first value of the output vector is 6. This tells us that if we were to sort our data from low to high, the first value in the myVec vector that we should select is the sixth value (in this case, the number 20). Next, we should select the second value, which is 31, and so on. The sort order that the order function produces can be used to sort vectors; however, the real benefit is felt when working with data frames. In Listing 11.1 we use order to sort the entire airquality data frame based on the Wind column. The order function is used to select rows in the subscript.

LISTING 11.1 Sorting Data Frames


 1: > sortedByWind <- airquality[order(airquality$Wind), ]
 2: > head(sortedByWind, 10)
 3:     Ozone Solar.R Wind Temp Month Day
 4: 53     NA      59  1.7   76     6  22
 5: 121   118     225  2.3   94     8  29
 6: 126    73     183  2.8   93     9   3
 7: 117   168     238  3.4   81     8  25
 8: 99    122     255  4.0   89     8   7
 9: 62    135     269  4.1   84     7   1
10: 54     NA      91  4.6   76     6  23
11: 66     64     175  4.6   83     7   5
12: 98     66      NA  4.6   87     8   6
13: 127    91     189  4.6   93     9   4


Another benefit of the order function is that it allows us to order data by more than one variable. Looking again at Listing 11.1 we can see that each of the last four printed rows has a Wind value of 4.6. Where two or more values match like this, R uses the original order of the data for the sorting. To instead specify a second ordering variable, we simply have to add the variable as the second argument to order. We can continue to add as many ordering variables as we like in this way.

Descending Sorts

The order function has an argument, decreasing, which if set to TRUE, can be used to sort from high to low instead of the default low to high. However, this only really helps us if we are sorting a single variable or if we want to specify that all the order variables should be sorted from high to low. If we want to be specific about which variables will be ascending and which are descending, then we accept the default decreasing = FALSE and place a minus sign (-) in front of any variables that require a descending sort. An example of this is shown in Listing 11.2, where the airquality data is sorted by Wind and then by descending values of Temp.

LISTING 11.2 Descending Sorts


1: > sortedByWindandDescTemp <- airquality[order(airquality$Wind, -airquality$Temp), ]
 2: > head(sortedByWindandDescTemp, 10)
 3:     Ozone Solar.R Wind Temp Month Day
 4: 53     NA      59  1.7   76     6  22
 5: 121   118     225  2.3   94     8  29
 6: 126    73     183  2.8   93     9   3
 7: 117   168     238  3.4   81     8  25
 8: 99    122     255  4.0   89     8   7
 9: 62    135     269  4.1   84     7   1
10: 127    91     189  4.6   93     9   4
11: 98     66      NA  4.6   87     8   6
12: 66     64     175  4.6   83     7   5
13: 54     NA      91  4.6   76     6  23


Appending

Appending, also commonly referred to as combining or setting, normally occurs when data are arriving to us in chunks over a time period. Each dataset we receive is structurally identical to the last but contains one or more new rows of data. All we therefore need to do is append the new rows to our existing data. In R this can be achieved using the rbind function, which you first saw in action with data frames in Hour 4, “Multi-Mode Data Structures.” To use rbind with data frames, we need to ensure that the column names and the type of data contained within the columns matches between the two data frames. The rbind function is clever enough to resolve any potential issues with factor levels.

> # New data arrives each month
> jan <- data.frame(Month = "Jan", Value = 46.4)
> feb <- data.frame(Month = "Feb", Value = 55.2)
> rbind(jan, feb)
  Month Value
1   Jan  46.4
2   Feb  55.2

Merging

For some reason R tends not to be compared favorably with languages such as SAS when it comes to merging, though as a user of both R and SAS I actually find it slightly easier to merge data in R than in SAS, and it certainly beats Excel! In R, there is no need to sort before a merge. In many cases, you can also get away without specifying the variable(s) you want to merge by, though it’s generally considered bad practice not to do so explicitly. The function that we use is the merge function.

The merge function allows us to merge two datasets by one or more common variables. The function has a number of arguments that can be used to control the “by” variables and match the rows in each dataset. These arguments are listed in Table 11.1.

Image

TABLE 11.1 Arguments to the merge Function

A Merge Example

In order to see the merge function in action, let’s walk through an example using two of the datasets contained within the mangoTraining package, demoData and pkData. The data frames contain data from a fictitious clinical trial in which 33 subjects were given doses of a drug and then monitored over time. First of all, let’s preview the data frames:

> head(demoData, 3)
  Subject Sex Age Weight Height  BMI Smokes
1       1   M  43     57    166 20.7     No
2       2   M  22     71    179 22.2     No
3       3   F  23     72    170 25.1     No
> head(pkData, 7)
  Subject Dose Time   Conc
1       1   25    0   0.00
2       1   25    1 660.13
3       1   25    6 178.92
4       1   25   12  88.99
5       1   25   24  42.71
6       2   25    0   0.00
7       2   25    1 445.55

For each of the 33 subjects in demoData there are five corresponding records in pkData representing times at which blood samples were taken during the fictitious study. In order to model drug concentration, Conc, as a response to Dose and each subject’s demographic information, we would need to create a single data frame containing all relevant information. We do this by merging the two data frames together by the Subject column:

> fullPk <- merge(x = demoData, y = pkData, by = "Subject")

The merge function requires at least an x and a y argument to specify the two data frames that we want to merge by. Here, we specified by = "Subject" to illustrate that we were merging by the common variable Subject. However, because this is a common variable, we could just as easily have omitted the argument and let R find the common variables to merge by:

> fullPk <- merge(x = demoData, y = pkData)

The arguments by.x and by.y come into play when the name of the variable(s) that we want to merge by differs within the two data frames. The x and y refer to the first two arguments of the function. Therefore, if Subject had been labeled ID in the pkData data frame (our “y” data frame), we would have specified by.x = "Subject", by.y = "ID".

Missing Data

The all, all.x, and all.y arguments control the way in which records are merged when a value of the by variable only appears in one of the two data frames. By default, each of these arguments is set to FALSE, meaning that records will only be merged if the value of the by variable appears in both data frames. In database terminology, this is commonly referred to as an inner join. This is probably best illustrated with an example. Suppose we take tiny subsets of demoData and pkData, keeping only data for the first two subjects in demoData and subjects 2 and 3 in pkData.

> demo1and2 <- demoData[demoData$Subject %in% 1:2, ]
> pk2and3 <- pkData[pkData$Subject %in% 2:3, ]
>
> demo1and2
  Subject Sex Age Weight Height  BMI Smokes
1       1   M  43     57    166 20.7     No
2       2   M  22     71    179 22.2     No
> pk2and3
   Subject Dose Time   Conc
6        2   25    0   0.00
7        2   25    1 445.55
8        2   25    6 129.31
9        2   25   12  93.33
10       2   25   24  46.11
11       3   25    0   0.00
12       3   25    1 500.65
13       3   25    6 146.04
14       3   25   12 116.93
15       3   25   24  68.25

The default behavior of merge only merges data for subject 2 because this is the only subject that appears in both data frames:

> merge(demo1and2, pk2and3)
  Subject Sex Age Weight Height  BMI Smokes Dose Time   Conc
1       2   M  22     71    179 22.2     No   25    0   0.00
2       2   M  22     71    179 22.2     No   25    6 129.31
3       2   M  22     71    179 22.2     No   25   12  93.33
4       2   M  22     71    179 22.2     No   25   24  46.11
5       2   M  22     71    179 22.2     No   25    1 445.55

Specifying all.x = TRUE retains all records in our “x” data (that is, demo1and2), regardless of whether they appear in pk2and3 (a.k.a. a “left join”). Specifying all.y = TRUE does likewise for pk2and3 (a “right join”). An “outer join,” where all records in each data frame are merged regardless of whether there is a matching value to merge by in the other data frame is achieved by specifying all = TRUE. An example of an outer join is provided next. Notice that in cases where the merge by variable only has records in the “x” data frame, values for all other variables in the “y”” data frame are set to NA, and vice versa.

> merge(demo1and2, pk2and3, all = TRUE)
   Subject  Sex Age Weight Height  BMI Smokes Dose Time   Conc
1        1    M  43     57    166 20.7     No   NA   NA     NA
2        2    M  22     71    179 22.2     No   25    0   0.00
3        2    M  22     71    179 22.2     No   25    6 129.31
4        2    M  22     71    179 22.2     No   25   12  93.33
5        2    M  22     71    179 22.2     No   25   24  46.11
6        2    M  22     71    179 22.2     No   25    1 445.55
7        3 <NA>  NA     NA     NA   NA   <NA>   25   12 116.93
8        3 <NA>  NA     NA     NA   NA   <NA>   25    0   0.00
9        3 <NA>  NA     NA     NA   NA   <NA>   25    1 500.65
10       3 <NA>  NA     NA     NA   NA   <NA>   25    6 146.04
11       3 <NA>  NA     NA     NA   NA   <NA>   25   24  68.25


Note: Naming Common Variables

If our two datasets have common variables that we do not wish to merge by, then R will append “.x” and “.y” to the column names in the resulting data frame. The suffixes argument can be used to create an alternative suffix.


Duplicate Values

The duplicated function finds duplicate values. It does so by asking the question, “Have I seen this before?” For example, take the Month column from the airquality data frame. The airquality data frame contains daily records for five months (May through September). In total there are therefore 153 individual values in the Month column but most are repeats. Calling duplicated on the column yields the following:

> isMonthValueADuplicate <- duplicated(airquality$Month)
> isMonthValueADuplicate[1:10]    # View first 10 records
 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

The fact that we can generate these TRUE and FALSE values like this is very useful. By placing ! in front of the call to duplicated, we switch the TRUE and FALSE values around. The corresponding logical vector can then be used to remove duplicate values and hence subset our data to leave only the first instance of a value occurring. Here, we use this to extract the first record for each month in the airquality dataset:

> airquality[!duplicated(airquality$Month), ]
    Ozone Solar.R Wind Temp Month Day
1      41     190  7.4   67     5   1
32     NA     286  8.6   78     6   1
62    135     269  4.1   84     7   1
93     39      83  6.9   81     8   1
124    96     167  6.9   91     9   1

Perhaps a more standard use of the duplicated function is to find and remove duplicated records. To achieve this, we can call duplicated directly on a data frame:

> # Create data with a duplicate record for ID==2
> duplicateData <- data.frame(ID = c(1,2,2,3,4), Score = c(57, 45, 45, 63, 54))
> duplicateData
  ID Score
1  1    57
2  2    45
3  2    45
4  3    63
5  4    54
> # Remove the duplicate record
> duplicateData[!duplicated(duplicateData),]
  ID Score
1  1    57
2  2    45
4  3    63
5  4    54


Tip: Unique Values

If we just want to identify the unique values within a vector, the unique function removes all duplicates within a vector and returns a smaller subset containing the unique values.


Restructuring

Before we can begin to fit models or even plot our data, we need to ensure that it is in a suitable structure. If it is not, we will need to restructure the data. SAS users would call this transposing the data. Excel users might call it pivoting. Others might call it reshaping or tidying. In R, the best known and most used packages for restructuring data are reshape, reshape2, and recently tidyr. Each of the packages has been written by Hadley Wickham and is based around the notion of what he now refers to as “tidy” data. We can think of the packages as an evolution (beginning with reshape and ending with tidyr). The terminology and usability have improved slightly with each, though the scope of these packages has actually decreased. We will therefore take a little time to look at the packages in turn.

Although the term “tidy data” might be unfamiliar, the concept is nothing new. If you are familiar with relational databases, the basic aim is to structure the data as you would in a database table. In other words, we structure the data such that

Image Each variable forms a column.

Image Each observation forms a row.

This differs from Excel, for which it is common to spread values that we want to compare across multiple columns in order to treat them as separate series when working with Excel’s plotting wizards. The tidy structure is, however, very standard in R, and most of the graphical and analytical packages in R expect a data frame in the tidy format.

Restructuring with reshape

The reshape and reshape2 packages offer essentially the same functionality for restructuring our data. We will work through an example using reshape and highlight differences within reshape2. There are several utility functions contained within the reshape package, but the main restructuring functions are melt, cast, and recast. The basic idea is to “melt” a data frame (using the melt function) into a very long and thin structure and then, if necessary, “cast” it (using the cast function in reshape or dcast in reshape2) into a new structure.


Tip: Getting to Grips with reshape via reshapeGUI

Reshaping data can be hard! The melt and cast functions in reshape are great but can take some getting used to. The reshapeGUI package provides an interactive graphical user interface for practicing using the melt and cast functions. When we use the GUI to select ID and measurement variables, it builds up the equivalent line of R code for us. The GUI also allows us to preview the results before we submit to the R console.


Melting

The trick to understand the melt function is to be able to identify what are referred to as ID and measurement (“measured”) variables within the package. ID variables represent fixed information about the data collected; this is usually IDs or names, geographic information about where the data was collected, the date and time the data was collected, and so on. The measurement variables contain the data we have collected. If you consider fitting a model to the data, then as a rough guide the measurement variables would be the response variables and the ID variables would be the explanatory variables.

Once we’ve decided what our ID variables are and what our measurement variables are, we feed them into the respective id.vars and measure.vars arguments. Any variables we are not interested in can be ignored and are excluded from the restructuring. To save some typing, we need only specify one of id.vars and one of measure.vars. R will assume that the rest of our variables fall into the unused category.

The melt function is best seen through an example. Listing 11.3 shows a simple example using the french_fries data contained within the reshape package. The data was originally collected from a sensory experiment to investigate fryer oils conducted at Iowa State University in 2004.

LISTING 11.3 Melting the french_fries Data


 1: > # Let's begin by loading the package and looking at the data
 2: > library(reshape)
 3: > head(french_fries, 3)
 4:    time treatment subject rep potato buttery grassy rancid painty
 5: 61    1         1       3   1    2.9     0.0      0    0.0    5.5
 6: 25    1         1       3   2   14.0     0.0      0    1.1    0.0
 7: 62    1         1      10   1   11.0     6.4      0    0.0    0.0
 8: > tail(french_fries, 3)
 9:     time treatment subject rep potato buttery grassy rancid painty
10: 695   10         3      78   2    3.3       0      0    2.5    1.4
11: 666   10         3      86   1    2.5       0      0    7.0   10.5
12: 696   10         3      86   2    2.5       0      0    8.2    9.4
13:
14: # Now we 'melt' having identified the ID variables
15: > fryMelt <- melt(french_fries,
16: +     id.vars = c("time", "treatment", "subject", "rep"))
17:
18: # Our new data is long and thin
19: > head(fryMelt, 3)
20:   time treatment subject rep variable value
21: 1    1         1       3   1   potato   2.9
22: 2    1         1       3   2   potato  14.0
23: 3    1         1      10   1   potato  11.0
24: > tail(fryMelt, 3)
25:      time treatment subject rep variable value
26: 3478   10         3      78   2   painty   1.4
27: 3479   10         3      86   1   painty  10.5
28: 3480   10         3      86   2   painty   9.4


Lines 1 to 11 of the listing show the basic structure of our data. We can deduce from the data that at each time point, a subject was given two French fries to taste that had undergone one of three treatments. The subject rated each of the fries using the criteria defined in the remaining columns. These remaining columns are therefore our measurement variables. The variables time, treatment, subject, and rep are our ID variables. Once we have identified the ID and measurement variables, the code is fairly straightforward; we call the melt function and specify the ID variables using id.vars. As can be seen from line 17 onward in the listing, the resulting data is very long and thin. The column names for the measurement variables have been stacked into a single column named variable, and the ID variables have been repeated accordingly. The associated values for the measurement variables have been stacked into a column named value.

Casting

Calling the melt function on a data frame will normally produce a data frame in the desired format. However, more often than not some further work is required in order to “cast” the data into a new structure. The cast function in reshape (or dcast in reshape2) accepts a formula that describes the shape of the output format. It has the following basic form:

untouched_column_1 + untouched_column_2 ~ column_to_split_1 + column_to_split_2

On the left side we specify the columns that are to remain as they are. On the right side we specify columns that are to be split apart into new columns. A new column will be created for each unique combination of values contained within the variables on the right side of the equation. We never reference the value column because this represents our content or measured data. The behavior is best seen using an example. In Listing 11.4 we create two new columns from the fryMelt data we created in Listing 11.3 based on the rep variable. The “...” notation is used to mean “all other columns.” A single period can also be used to represent “no variable” in the casting formula.

LISTING 11.4 Casting the french_fries Data


 1: > # Create two new columns based on the rep variable
 2: > fryReCast <- cast(fryMelt, ... ~ rep)
 3: > head(fryReCast, 3)
 4:   time treatment subject variable   1  2
 5: 1    1         1       3   potato 2.9 14
 6: 2    1         1       3  buttery 0.0  0
 7: 3    1         1       3   grassy 0.0  0



Note: Differences Between reshape and reshape2

In reshape2 the distinction is made between casting to data frames and casting to arrays. Instead of the cast function, we have two new functions: acast for arrays and dcast for data frames.


Using melt and then cast (or dcast) helps break up the reshaping process. For more complicated examples, it can be really useful to check that the intermediate “melted” data frame is as expected before casting into a new shape. However, this is not actually a necessary step. The entire transformation can be performed in a single step using the recast function. The only difference when using recast is that instead of the id.vars and measure.vars arguments that we used in melt, we drop the “s” and use id.var and measure.var instead.

> recast(french_fries,
+     id.var = c("time", "treatment", "subject", "rep"),
+     formula = ... ~ rep)
  time treatment subject variable    1    2
1    1         1       3   potato  2.9 14.0
2    1         1       3  buttery  0.0  0.0
3    1         1       3   grassy  0.0  0.0
...


Note: Aggregation Using reshape

The fun.aggregate argument to cast (and dcast in reshape2) provides the ability to aggregate the data using summary functions such as mean.

> # Mean across replicates
> replicateMeans <-
+   cast(fryMelt, time + treatment + subject + variable ~ ., mean)
> head(replicateMeans, 3)
  time treatment subject variable (all)
1    1         1       3   potato  8.45
2    1         1       3  buttery  0.00
3    1         1       3   grassy  0.00

Although it is possible to aggregate data using reshape, we will look at more straightforward aggregation techniques later in the hour and then again in Hour 12.


Restructuring with tidyr

The main difference between the reshape approach to restructuring and tidyr is the terminology. The functions melt and cast (or dcast) become gather and spread. Otherwise, the idea is very much the same. In tidyr we also have a third option, separate, that comes in handy when multiple pieces of information are stored together in a single variable.

Gather

When the values of a particular variable are spread over several columns, we look to “gather” the data into a single column. We do this using gather. The required arguments to the gather function are shown in Table 11.2.

Image

TABLE 11.2 Arguments to the gather Function

Let’s look at how we would use gather with some real data. For this example, we will use the djiData stock data contained within the mangoTraining package. To simplify the example, we will first subset the data to obtain a data frame with three columns; the date, and the low and high values for the DJI for each date:

> djiHighLow <- djiData[, c("Date", "DJI.High", "DJI.Low")]
> head(djiHighLow, 3)
        Date DJI.High  DJI.Low
1 12/31/2014 18043.22 17820.88
2 12/30/2014 18035.02 17959.70
3 12/29/2014 18073.04 18021.57

Suppose that we want to create a single graphic of the high and low DJI values using one of the packages described in Hours 1315. We need one column containing the values to plot and another column specifying whether each value was a high or a low value. We do this using the gather function.

Having loaded the package, we next specify each of the columns we wish to gather, separated by a comma, referencing each by name directly and without wrapping in quotes. As highlighted in Table 12, we must also specify names for the key and value columns in the gathered data frame. In this example, we gather two columns, DJI.High and DJI.Low, but in general we can specify as many columns as we like:

> gatheredDJI <- gather(djiHighLow, key="DJI", value="Value", DJI.High, DJI.Low)
> head(gatheredDJI, 4)
        Date      DJI    Value
1 2014-12-31 DJI.High 18043.22
2 2014-12-30 DJI.High 18035.02
3 2014-12-29 DJI.High 18073.04
4 2014-12-26 DJI.High 18103.45

Variables that are not listed, such as Date in the preceding example, are unaffected by the gathering process. If we find the need to gather the majority of columns within our data, then instead of specifying what to gather we can specify what not to gather. We do so by listing columns that we are not interested in and placing a minus sign in front of each one.


Tip: Lots to Gather?

The tidyr package allows a special use of the : operator for sequencing. The operator allows us to specify a “from” and a “to” in terms of column names. Therefore, a:z would be interpreted as start gathering at column “a” and gather all columns up to column “z.”


Spread

The term “spread” is similar to “cast” in reshape. It enables us to take a column of values and a column label for these values (the “key”) and “spread” the contents over several columns. The primary arguments to spread are again key and value. A new column is created for each label in the key column. This can be useful if we need to calculate, say, changes over time. We take a column of values, value, and a column of times, key, at which these values occurred. We then spread the information, creating a new column for each time point. In the following example, we undo the process of gathering the low and high DJI values into a single column, spreading back into the two original columns:

> backToOriginal <- spread(gatheredDJI, key = DJI, value = Value)
> head(backToOriginal, 3)
        Date DJI.High  DJI.Low
1 01/02/2014 16573.07 16416.49
2 01/03/2014 16518.74 16439.30
3 01/06/2014 16532.99 16405.52


Tip: Piping Commands

The tidyr package has been designed to work with magrittr’s pipe operator. This allows us to chain commands together, thus avoiding intermediate data frames. You will learn more about the pipe operator in Hour 12.


Separate

Occasionally we may find ourselves in a situation where two separate pieces of information are joined together in a single variable. R packages provide a nice example of this. An R package source name is made up of a package name and version number. An example of this is shown here:

> Packages <- data.frame(Source=c("reshape_0.8.5", "tidyr_0.2.0"))
> Packages
         Source
1 reshape_0.8.5
2   tidyr_0.2.0

We can use the separate function to split the package names from the version numbers. Further arguments such as sep are used to specify the splitting character:

> separate(Packages, Source, into = c("Package", "Version"), sep = "_")
  Package Version
1 reshape   0.8.5
2   tidyr   0.2.0

By default, the original variable is deleted. We override this behavior, however, by specifying remove = FALSE.

Data Aggregation

In Hour 9, “Loops and Summaries,” you saw two ways of applying simple functions to more complex data structures:

Image Iterate over sections of data with a loop.

Image Use one of the apply family of functions.

Let’s consider if we want to add a new column to airquality, containing the difference between the Wind speed for a particular day and the median Wind speed for that Month. To achieve this, we need to perform three tasks:

Image Calculate the median Wind speed by Month.

Image Align the median Wind speed value calculated with the daily Wind speed data.

Image Calculate the difference between the daily Wind speed and the “median” data.

Using a “for” Loop

If we choose to use loops, we could do the following, for example:

Image Create an empty column in our data.

Image For each row in the data:

Image Look at the Month value for this row.

Image Calculate the median Wind for all data with that Month value.

Image Calculate the difference between the daily Wind value and this median.

Image Insert this value in the cell.

This approach is very inefficient. For example, it involves calculating a median repeatedly (once per row). Instead, we could calculate the medians using one loop and then reference the values in a second loop, using an approach like this:

Image Create an empty column in our data.

Image For each unique Month value, calculate and store the mean Wind.

Image For each row in the data:

Image Look at the Month value for this row.

Image Reference the correct median Wind for that Month value (from previous loop).

Image Calculate the difference between the daily Wind value and this median.

Image Insert this value in the cell.

Again, this isn’t ideal. Let’s instead consider (and see) an approach using the “apply” functions that we saw in Hour 9.

Using an “apply” Function

The first thing we have to decide is which “apply” function to use. Let’s first use the tapply function (or split and sapply) to return the median Wind by Month:

> head(airquality)   # Print airquality
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

> windMedians <- tapply(airquality$Wind, airquality$Month, median)
> windMedians
   5    6    7    8    9
11.5  9.7  8.6  8.6 10.3

This is straightforward and calculates the median Wind speed by Month, storing the results in a named vector. The next step is to align the daily values with the corresponding windMedians values so we can calculate the differences. This is, perhaps, the most complex part of this process.

As you saw in Hour 3, “Single-Mode Data Structures,” we can reference values from a vector using square brackets and specifying with blank, positive, negative, logical, or character inputs. In this case, we have a vector of Month values to use to reference values from the windMedians vector. Let’s convert our Month values to characters and then use those values to reference the (named) elements of windMedians:

> charMonths <- as.character(airquality$Month)     # Converted character values of
                                                     Month
> # Use character values to reference named elements
> head(windMedians [ charMonths ])
   5    5    5    5    5    5
11.5 11.5 11.5 11.5 11.5 11.5

Now we can create a column of means in our dataset and calculate differences from those. Of course, we don’t have to create the column of intermediate values, but we included it here to help illustrate the process:

> airquality$MedianWind <- windMedians [ charMonths ]             # Add Median Wind
                                                                    column
> airquality$DiffWind <- airquality$Wind - airquality$MedianWind  # Calculate
                                                                    differences
> head(airquality, 3)                                             # First few rows
  Ozone Solar.R Wind Temp Month Day MeanWind DiffWind MedianWind
1    41     190  7.4   67     5   1     11.5     -4.1       11.5
2    36     118  8.0   72     5   2     11.5     -3.5       11.5
3    12     149 12.6   74     5   3     11.5      1.1       11.5
> tail(airquality, 3)                                             # Last few rows
    Ozone Solar.R Wind Temp Month Day MeanWind DiffWind MedianWind
151    14     191 14.3   75     9  28     10.3      4.0       10.3
152    18     131  8.0   76     9  29     10.3     -2.3       10.3
153    20     223 11.5   68     9  30     10.3      1.2       10.3

This approach works, but the second step (aligning the means with the daily values) was perhaps a little complex. If we decide later that we want to perform the same process for a number of columns, the solution would become more verbose/complex. We can simplify this approach using the aggregate function.

The aggregate Function

The aggregate function allows us apply functions over sections of a data frame, returning a data frame as the output. We can use aggregate using two different methods:

Image We can supply a “formula” to describe the data over which to apply.

Image We can specify a set of variables to summarize and a set of variables by which to summarize separately.

Let’s first see an example using a formula to define the structure of the data.

Using aggregate with a Formula

We can use a formula with aggregate to specify the variables to summarize and the variables by which to perform the summary. A basic formula is of the form Y ~ X, where Y is the variable to summarize and X is the variable by which to summarize. The aggregate function additionally accepts a data argument (specifying the data frame containing the data) and a FUN argument (specifying the function to apply). Let’s look at a simple example where we again calculate the median Wind by Month:

> aggregate(Wind ~ Month, data = airquality, FUN = median)
  Month Wind
1     5 11.5
2     6  9.7
3     7  8.6
4     8  8.6
5     9 10.3

As you can see, the return structure is a data frame, which is a very simple and useable structure.

Summarizing by Multiple Variables

If we want to apply the function by more than one variable, we can add the names of the variables to the set of variables in the formula:

> aggregate(Wind ~ Month + cut(Temp, 2), data = airquality, FUN = median)
   Month cut(Temp, 2) Wind
1      5    (56,76.5] 11.5
2      6    (56,76.5]  9.7
3      7    (56,76.5] 10.6
4      8    (56,76.5] 12.6
5      9    (56,76.5] 10.9
6      5    (76.5,97] 10.3
7      6    (76.5,97]  9.7
8      7    (76.5,97]  8.6
9      8    (76.5,97]  8.3
10     9    (76.5,97]  7.7

Again, the return structure is a data frame.

Summarizing Multiple Columns

If we want to perform the same summary on a number of variables at the same time, we can combine the summary variables in a call to cbind. For example, let’s calculate the median Wind and Ozone values by Month:

> aggregate(cbind(Wind, Ozone) ~ Month, data = airquality, FUN = median, na.rm = TRUE)
  Month Wind Ozone
1     5 11.5    18
2     6 11.5    23
3     7  7.7    60
4     8  8.0    52
5     9 10.3    23

Multiple Return Values

In the preceding examples, we used the median function, which returns a single value. If, instead, we used a function that returned multiple values, these would be returned as separate columns. To illustrate this behavior, let’s repeat the last three examples with the range function:

> # Range of Wind values by Month
> aggregate(Wind ~ Month, data = airquality, FUN = range, na.rm = TRUE)
  Month Wind.1 Wind.2
1     5    5.7   20.1
2     6    1.7   20.7
3     7    4.1   14.9
4     8    2.3   15.5
5     9    2.8   16.6

> # Range of Wind AND Ozone values by Month
> aggregate(cbind(Wind, Ozone) ~ Month, data = airquality, FUN = range, na.rm = TRUE)
  Month Wind.1 Wind.2 Ozone.1 Ozone.2
1     5    5.7   20.1       1     115
2     6    8.0   20.7      12      71
3     7    4.1   14.9       7     135
4     8    2.3   15.5       9     168
5     9    2.8   16.6       7      96

> # Range of Wind AND Ozone values by Month AND grouped Temp
> aggregate(cbind(Wind, Ozone) ~ Month + cut(Temp, 2), data = airquality,
+           FUN = range, na.rm = TRUE)
   Month cut(Temp, 2) Wind.1 Wind.2 Ozone.1 Ozone.2
1      5    (56,76.5]    6.9   20.1       1      41
2      6    (56,76.5]    9.2   20.7      12      37
3      7    (56,76.5]    6.9   14.3      10      16
4      8    (56,76.5]    7.4   14.3       9      23
5      9    (56,76.5]    6.9   16.6       7      30
6      5    (76.5,97]    5.7   14.9      45     115
7      6    (76.5,97]    8.0   14.9      21      71
8      7    (76.5,97]    4.1   14.9       7     135
9      8    (76.5,97]    2.3   15.5       9     168
10     9    (76.5,97]    2.8   15.5      16      96

In these examples, the values returned are named based on the column that was summarized and an index of the return value. If, instead, the function returned “named” elements, these names would be appended to the summarized column names:

> aggregate(Wind ~ Month, data = airquality,
+   FUN = function(X) {
+     c(MIN = min(X), MAX = max(X))
+   })
  Month Wind.MIN Wind.MAX
1     5      5.7     20.1
2     6      1.7     20.7
3     7      4.1     14.9
4     8      2.3     15.5
5     9      2.8     16.6

Using aggregate by Specifying Columns

Instead of the formula, we can use aggregate by specifying variables separately in the function call. Specifically, we specify lists of variables, which we can rename when specifying the variables if we want to control the names of the resulting summary variables:

Image The first input specifies the variable(s) to summarize.

Image The second input specifies the grouping variable(s).

Image The third input is the function to apply.

Let’s again calculate the median Wind by Month, this time specifying the inputs as described earlier:

> aggregate(list(aveWind = airquality$Wind), list(Month = airquality$Month), median)
  Month aveWind
1     5    11.5
2     6     9.7
3     7     8.6
4     8     8.6
5     9    10.3

The output is a data frame, with the variables named as specified in the input lists.

Summarizing by Multiple Variables

If we want to apply the function by more than one variable, we can add these variables to the list, as follows:

> aggregate(list(aveWind = airquality$Wind),
+   list(Month = airquality$Month, TempGroup = cut(airquality$Temp, 2)), median)
   Month TempGroup aveWind
1      5 (56,76.5]    11.5
2      6 (56,76.5]     9.7
3      7 (56,76.5]    10.6
4      8 (56,76.5]    12.6
5      9 (56,76.5]    10.9
6      5 (76.5,97]    10.3
7      6 (76.5,97]     9.7
8      7 (76.5,97]     8.6
9      8 (76.5,97]     8.3
10     9 (76.5,97]     7.7

Again, this approach allows us to easily control the names of the resulting variables (for example, naming the TempGroup and aveWind columns).

Summarizing Multiple Columns

If we want to perform the same summary on a number of variables at the same time, we can provide multiple variables in the first input list, as follows:

> aggregate(list(aveWind = airquality$Wind, aveOzone = airquality$Ozone),
+           list(Month = airquality$Month), median, na.rm = TRUE)
  Month aveWind aveOzone
1     5    11.5       18
2     6     9.7       23
3     7     8.6       60
4     8     8.6       52
5     9    10.3       23


Tip: Specifying Inputs as Data Frames

Because a data frame is, structurally, a list of vectors, we can supply data frame inputs directly instead of lists, if preferred. This is most useful when there are multiple variables being specified. For example, we could rewrite the last example as follows:

> aggregate(airquality[,c("Wind", "Ozone")],
+           list(Month = airquality$Month), median, na.rm = TRUE)
  Month Wind Ozone
1     5 11.5    18
2     6  9.7    23
3     7  8.6    60
4     8  8.6    52
5     9  10.3   23

Although this is far more concise, we do lose the ability to directly rename the variables (for example, to aveWind and aveOzone as per the previous example).


Multiple Return Values

As with the example where we specified formulas, we can apply functions that return multiple values. In this case, the index of values is appended to the summarized variable name:

> aggregate(list(Wind = airquality$Wind),
+   list(Month = airquality$Month), range)
  Month Wind.1 Wind.2
1     5    5.7   20.1
2     6    1.7   20.7
3     7    4.1   14.9
4     8    2.3   15.5
5     9    2.8   16.6

Again, if our function returns named elements, these are appended instead of the index values:

> aggregate(list(Wind = airquality$Wind),
+           list(Month = airquality$Month),
+           function(X) {
+             c(MIN = min(X), MAX = max(X))
+           })
  Month Wind.MIN Wind.MAX
1     5      5.7     20.1
2     6      1.7     20.7
3     7      4.1     14.9
4     8      2.3     15.5
5     9      2.8     16.6

Calculating Differences from Baseline

At the start of the Data Aggregation section, we introduced a task that we were aiming to complete and discussed how the previous approaches (for loops and apply functions) could be used to achieve that task. To recap, we are aiming to add a new column to airquality, containing the difference between the Wind speed for a particular day and the median Wind speed for that Month.

To achieve this, we need to perform three tasks:

Image Calculate the median Wind speed by Month.

Image Align the median Wind speed value calculated with the daily Wind speed data.

Image Calculate the difference between the daily Wind speed and the “median” data.

Using the aggregate function, we can calculate the median Wind by Month, returning our results as a data frame:

> windMedians <- aggregate(list(MedianWind = airquality$Wind),
+                          list(Month = airquality$Month), median)
> windMedians
  Month MedianWind
1     5       11.5
2     6        9.7
3     7        8.6
4     8        8.6
5     9       10.3


Note: Using List Inputs to Aggregate

In this example, I’m specifying the inputs to aggregate as list elements, instead of a formula, so I can explicitly control the naming of the summary (that is, the MedianWind column). If I used a formula, I’d need to rename the column to MedianWind as a second step.


Now that we have our median Wind values in a data frame, we can merge this onto our original dataset to create the MedianWind column:

> airquality <- merge(airquality, windMedians)
> head(airquality)
  Month Ozone Solar.R Wind Temp Day MedianWind
1     5    41     190  7.4   67   1       11.5
2     5    36     118  8.0   72   2       11.5
3     5    12     149 12.6   74   3       11.5
4     5    18     313 11.5   62   4       11.5
5     5    NA      NA 14.3   56   5       11.5
6     5    28      NA 14.9   66   6       11.5

Summary

In this hour, you saw how to sort, set, and merge data using traditional R functions. We looked at the popular reshape (reshape2) and tidyr packages for restructuring our data, ready for plotting and modeling. We also looked at various options for aggregating data including the powerful aggregate function.

In the next hour, we will look closer at two packages that are changing the way people manipulate and summarize data with R. The data.table and dplyr packages offer speed and efficiency, borrowing approaches from the database world.

Q&A

Q. I tried to sort the airquality data using airquality[sort(airquality$Wind),] but got strange results. What happened?

A. To sort a data frame in this way, you need to know which rows to select. The sort order is returned by the order function, not sort.

Q. I have two data frames, each containing data for specified locations at specified times. Can I merge by both variables?

A. Absolutely. You can specify as many merge-by-variable operations as you like using merge. Pass the names to merge as a character vector.

Q. Is it possible to merge three data frames at once using merge?

A. Unfortunately, no. However, the merge_recurse function in reshape provides this functionality.

Q. Should I be using reshape2 instead of reshape?

A. Development of reshape ceased in 2011. However, it depends on what you want to do. In some sense, reshape2 supersedes reshape; however, there is arguably more functionality contained within reshape. If you want to use reshape/reshape2 for data aggregation, it is worth noting that the cast function can handle summary functions such as range that produce a vector of multiple values, whereas dcast cannot and fails with an error.

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. What is the difference between sort and order?

2. Which function can be used to return the unique values in a vector?

3. What function would you use to append rows to a data frame?

4. What does the “d” represent in the dcast function?

Answers

1. The sort function is used to sort vectors. It cannot be used to sort data frames. The order function provides a sort order that can be used to sort vectors or data frames.

2. The unique function directly returns the unique values. Alternatively, duplicated could be used as a means to subscript and obtain the same result.

3. The rbind function is a simple means of appending new rows to a data frame.

4. The “d” stands for “data frame.” In reshape2, the more generic cast was replaced with acast and dcast functions to allow casting to both arrays and data frames via separate functions.

Activities

1. Sort the mtcars data frame by the number of cylinders and then descending by miles per gallon.

2. Extract the “Employees” and “Orders” tables from the Northwind.mdb file contained within the mangoTraining package using RODBC. Merge the two data frames by EmployeeID.

3. Use melt and dcast to find the average tip size by the sex and smoking habit of the bill payer using the tips data contained within the reshape2 package.

4. Separate the Date column within djiData into three new columns: Month, Day, and Year. Ensure that you keep the original Date column.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset