Preparing your data for analysis with the tidyr package

The tidyr package is another gift from Hadley Wickham. This package provides functions to make your data tidy.

This means that after applying the tidyr package's function, your data you will be arranged as per the following rules:

  • Each column will contain an attribute
  • Each row will contain an observation
  • Each cell will contain a value

These rules will produce a dataset similar to the following one:

Preparing your data for analysis with the tidyr package

This structure, besides giving you a clearer understanding of your data, will let you work with it more easily.

Furthermore, this structure will let you take full advantage of the inner R-vectorized structure. This recipe will show you how to apply the gather function to a dataset in order to transform a dataset and make it comply with the cited rules.

The employed data frame is in the so-called wide format, where each period of observation is stored in columns, with each column representing a year, as follows:

Preparing your data for analysis with the tidyr package

Getting ready

In order to let you apply the tidyr function, you will have to install and load the tidyr package within your R environment by running the following code:

install.packages("tidyr")
library(tidyr)

Moreover, we will need to install and load the rio package, which is covered in greater detail in the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project:

install.packages("rio")
library(rio)

The dataset tidied in this recipe is the world_gdp_data.csv dataset.

This dataset is provided in the RStudio project for this book.

You can download it by authenticating your account at http://packtpub.com.

The world_gdp_data dataset stores GDP values for 248 countries around the globe, from 1960 to 2015.

In the There's more… section, you will find other examples of messy data for which tidyr comes handy.

How to do it...

  1. Create a data frame with your data:
    messy_gdp <- import("world_gdp_data.csv")
  2. Apply the gather() function to your data frame:
    tidy_gdp  <- gather(messy_gdp,"year","gdp",5:61)
  3. Visualize the result with the RStudio viewer:
    View(tidy_gdp)

How it works...

In step 1, we create a data frame with our data. This step leverages the rio package, which is treated in the Loading your data into R with rio packages recipe in Chapter 1, Acquiring Data for Your Project. Refer to it for further explanations.

In step 2, we apply the gather() function to our data frame. The gather() functions is one of the four functions available within the tidyr package:

  • gather()
  • spread()
  • unite()
  • separate()

This function basically retrieves values spread within the messy dataset and creates a new data frame where those values are exposed, following a key attribute.

More formally, here we take a dataset exposed in a wide form and transform it into a long form dataset.

In our example, values are annual GDP and keys are years.

The gather() function requires you to specify the following arguments:

  • Data
  • Name to assign to the key column
  • Name to assign to the value column
  • Columns in which to find values
  • All other columns will be left unchanged and their values will be repeated as needed

In step 3, we visualize the result with the RStudio viewer. One of the most powerful RStudio tools is the data viewer, which lets you get a spreadsheet-like view of your data frames. With RStudio 0.99 Version, this tool got even more powerful, removing the previous 1000-row limit and adding the ability to filter and order your data.

When using this viewer, you should be aware that all filtering and ordering activities will not affect the original data frame object you are visualizing.

There's more...

As seen earlier, in addition to gather(), paragraph, the tidyr package supplies the following three functions for data preparation:

The spread() function is used when variables are stored in a column, as is the case in the following dataset, named:

  • second_messy_world_gdp:
    country, data ,value
    italy, year, 2012
    italy,gdp, 20000
    russia,year,2012
    russia,gdp,1100000
  • Running spread(messy_world_gdp, data,value) will result in the following tidy dataset:
    • tidy_world_gdp:
      country, year, gdp
      italy,2012,20000
      russia,2012,1100000

The separate() function is used when two or more variables are stored in a single column joined side by side:

  • third_messy_world_gdp:
    country, year_gdp
    italy,2012_20000
    russia,2012_1100000

In this case, running separate(messy_world_gdp,year_gdp,c("year","gdp"),sep ="_") will result in the following tidy data frame:

  • country, year, gdp
    italy,2012,20000
    russia,2012,1100000

The unite() function can be considered the opposite of separate() and can be used when a single variable is spread among different columns. Here is an example:

  • fourth_messy_world_gdp:
    year, month,day, value
    2012,12,31,120003
    2012,05,12,4533203

In this case, we can join the first free columns, creating a date variable simply by running unite(fourth_messy_world_gdp,col = "record_date",c(year,month,day,), sep = "_"), which will result in the following tidy data frame:

record_date, value
2012_12_31,120003
2012_05_12,4533203

Please note that both the separate() and unite() functions require us to specify a sep parameter, indicating in the first case the character to look for in order to perform the column separation, and in the second case the argument that will be used as a joining character between the column values.

More about tidy data and its use can be found in the paper Tidy data, written by Hadley Wickham, the main author of the package. The paper is freely available at http://www.jstatsoft.org/article/view/v059i10.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset