The tidyr
package is another gift from Hadley Wickham. This package provides functions to make your data tidy.
This means that after applying the tidyr
package's function, your data you will be arranged as per the following rules:
These rules will produce a dataset similar to the following one:
This structure, besides giving you a clearer understanding of your data, will let you work with it more easily.
Furthermore, this structure will let you take full advantage of the inner R-vectorized structure. This recipe will show you how to apply the gather
function to a dataset in order to transform a dataset and make it comply with the cited rules.
The employed data frame is in the so-called wide format, where each period of observation is stored in columns, with each column representing a year, as follows:
In order to let you apply the tidyr
function, you will have to install and load the tidyr
package within your R environment by running the following code:
install.packages("tidyr") library(tidyr)
Moreover, we will need to install and load the rio
package, which is covered in greater detail in the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project:
install.packages("rio") library(rio)
The dataset tidied in this recipe is the world_gdp_data.csv
dataset.
This dataset is provided in the RStudio project for this book.
You can download it by authenticating your account at http://packtpub.com.
The world_gdp_data
dataset stores GDP values for 248 countries around the globe, from 1960 to 2015.
In the There's more… section, you will find other examples of messy data for which tidyr
comes handy.
messy_gdp <- import("world_gdp_data.csv")
gather()
function to your data frame:tidy_gdp <- gather(messy_gdp,"year","gdp",5:61)
View(tidy_gdp)
In step 1, we create a data frame with our data. This step leverages the rio
package, which is treated in the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project. Refer to it for further explanations.
In step 2, we apply the
gather()
function to our data frame. The gather()
functions is one of the four functions available within the tidyr
package:
gather()
spread()
unite()
separate()
This function basically retrieves values spread within the messy dataset and creates a new data frame where those values are exposed, following a key
attribute.
More formally, here we take a dataset exposed in a wide form and transform it into a long form dataset.
In our example, values are annual GDP and keys are years.
The gather()
function requires you to specify the following arguments:
In step 3, we visualize the result with the RStudio viewer. One of the most powerful RStudio tools is the data viewer, which lets you get a spreadsheet-like view of your data frames. With RStudio 0.99 Version, this tool got even more powerful, removing the previous 1000-row limit and adding the ability to filter and order your data.
When using this viewer, you should be aware that all filtering and ordering activities will not affect the original data frame object you are visualizing.
As seen earlier, in addition to gather()
, paragraph, the tidyr
package supplies the following three functions for data preparation:
The spread()
function is used when variables are stored in a column, as is the case in the following dataset, named:
second_messy_world_gdp
:country, data ,value italy, year, 2012 italy,gdp, 20000 russia,year,2012 russia,gdp,1100000
spread(messy_world_gdp, data,value)
will result in the following tidy dataset:tidy_world_gdp
:country, year, gdp italy,2012,20000 russia,2012,1100000
The separate()
function is used when two or more variables are stored in a single column joined side by side:
third_messy_world_gdp
:country, year_gdp italy,2012_20000 russia,2012_1100000
In this case, running separate(messy_world_gdp,year_gdp,c("year","gdp"),sep ="_")
will result in the following tidy data frame:
country, year, gdp
italy,2012,20000 russia,2012,1100000
The unite()
function can be considered the opposite of separate()
and can be used when a single variable is spread among different columns. Here is an example:
fourth_messy_world_gdp
:year, month,day, value 2012,12,31,120003 2012,05,12,4533203
In this case, we can join the first free columns, creating a date variable simply by running unite(fourth_messy_world_gdp,col = "record_date",c(year,month,day,), sep = "_")
, which will result in the following tidy data frame:
record_date, value 2012_12_31,120003 2012_05_12,4533203
Please note that both the separate()
and unite()
functions require us to specify a sep
parameter, indicating in the first case the character to look for in order to perform the column separation, and in the second case the argument that will be used as a joining character between the column values.
More about tidy data and its use can be found in the paper Tidy data, written by Hadley Wickham, the main author of the package. The paper is freely available at http://www.jstatsoft.org/article/view/v059i10.