Introduction to dplyr

What is dplyr? Well, dplyr can be perceived as a grammar of data manipulation. It has been created for the R community by Hadley Wickham, Romain Francois, and RStudio.

What does dplyr give the Tableau user? We will use dplyr in order to cleanse, summarize, group, chain, filter, and visualize our data in Tableau.

Summarizing the data with dplyr

Firstly, let's import the packages that we need. These packages are listed in the following table, followed by the code itself.

Packages required for the hands-on exercise:

Package Name

Description

Reference

WDI

Search, extract, and format data from the World Bank's World Development Indicators

https://cran.r-project.org/web/packages/WDI/index.html

dplyr

dplyr is a grammar of data manipulation

 

As we walk through the script, the first thing we need to do is install the packages.

Once you have installed the packages, we need to call each library.

Once we have called the libraries, then we need to obtain the data from the World Data Bank website. When we have downloaded the data, we can assign it to a variable. In this example, we will call the variable dat. Before we do any further analysis, we will summarize the data using the summary command.

To start working with the data, let's use the summary command. In this example, the summary command is given here:

summary(dat)

In the next step, let's use the head command in order to see some of the data. Here is the command:

head(dat)

With the dplyr package, we can filter data so that we only see the data that we would like to see. For example, for the World Development Indicators (WDI) data, we may wish to see data that is labeled for the whole world. This means that it would match the word WORLD. In the next command, we can see the filtered data. The filter command looks for data that matches the condition in the code. In this example, the data is scanned for rows where the region is shown to be set to world:

dat[dat$Major.area..region..country.or.area.. == "WORLD"]

When we use the slice command, we can see that the data is shown for the rows that you specify in the command line.

This means that we can slice this data so that it only shows data that appears within the number of rows by position. In this example, we restrict the slice command so that it shows only the first three lines of code. The slice command is nice and neat. In contrast, if we were to rewrite the slice command into the corresponding command in R, the command would be less readable and it would be much longer.

We can also reorder data by using the arrange command. This piece of code will order the data along the lines of the columns that is stated in the function's arguments. So, for example, the following code will reorder the data using the Country.code column:

arrange(dat, Country.code)

We can also choose columns by using the select command. This command specifies which columns we would like to return. It allows us to rename the column names using variables, using the named arguments specified in the piece of code. Here is an example, where some of the column names are renamed.

However, the select command drops all the variables that are not explicitly stated in the code. Unfortunately, this renders the select command to be not that useful. Instead, we can use the rename command to achieve the same thing as the select command, but the variables are all retained, whether they are explicitly stated or not. Here is an example of the rename command in use:

rename(dat,Index =Index,Variant=Variant,region=Major.area..region..country.or.area.., Notes=Notes, Countrycode=Country.code )

The select command is very useful when it is combined with the distinct command. The distinct command is designed to return the unique values in a data frame. The select command is used to retrieve data for the specified columns, and it is combined with the distinct command to return the unique values for the selected columns in a data frame. Here is an example where we use the select and distinct commands together, and the results are assigned to a variable called sampledat:

sampledat <- distinct(select(dat,Index,Variant,region=Major.area..region..country.or.area.., Notes, Countrycode=Country.code, X2015=X2015, X2016=X2016, X2017=X2017 ) )

Let's take a look at the output of the sampledat variable using the head command:

head(sampledat)
Summarizing the data with dplyr

Output of sample data

Now, let's repeat the command for the first 20 years in the dataset, from 2015 until 2035. To do this, we will execute a piece of code, which does the following things:

It selects the relevant columns, and renames some of the columns so that they are more readable. The year column names are renamed to remove the letter X prefix, and the year names are turned into characters by virtue of being enclosed in quotation marks. Then, the code assigns the result to a variable called distinctdat:

distinctdat <- distinct(select(dat,Index=Index,Variant=Variant,Region=Major.area..region..country.or.area.., Notes, Countrycode=Country.code,
"2015"=X2015, "2016"=X2016,"2017"=X2017,"2018"=X2018,"2019"=X2019,"2020"=X2020,
"2021"=X2021,"2022"=X2022,"2023"=X2023,"2024"=X2024,"2025"=X2025,"2026"=X2026,"2027"=X2027,"2028"=X2028,"2029"=X2029,
"2030"=X2030,"2031"=X2031,"2032"=X2032,"2033"=X2033,"2034"=X2034,"2035"=X2035))

Let's take a look at the resulting data, which is stored in the distinctdat variable, using the previous head command:

head(distinctdat)

This is an excerpt of the result set, contained in the distinctdat variable:

Summarizing the data with dplyr

Result set showing erroneous data format

In the column headed 2015, it's possible to see that the values contain spaces. If we were to write this data to a CSV file, only the numbers after the space would be written to the file. This means that we will need to do something about the spaces before that point.

We could use a substitution command, such as gsub, to remove the spaces for every year, from 2015 right through to 2035. However, this would mean repeating the command for each year. We could also write a function for this purpose, and call it for every year.

Although this method would work, the resulting data would not appear nicely in Tableau. The reason for this is that each year is still treated separately, even though the actual metric is the same. Ideally, it's better to unpivot the data.

Unpivoting columns creates an attribute column for each selected column heading and a value column for each column cell value. The attribute-value pair columns are inserted after the last column. In R, we can unpivot data using the melt command.

In our example, we would like to unpivot the data held in the distinctdat variable along the attribute columns, which are Index, Variant, Region, Notes, and Countrycode. The other rows, which hold the data from 2015 to the year 2035, would all be placed into two columns. One column will hold the year, and the other column will hold the value of the projected world population. This unpivot result is achieved with the following code:

melteddata <- melt(distinctdat, id=c("Index","Variant","Region","Notes","Countrycode"))

In the preceding example, each date column becomes a row with an attribute column containing the date value and a value column containing the date column value. So, if we run the preceding command, then data is stored in the melteddata variable.

If we slice the melteddata variable, then we can see the results more clearly:

Summarizing the data with dplyr

From the preceding example code output, we can see that the year data now appears in the variable column. The population count data is now held in the value column.

Now it's possible to work with the value column in order to remove the spaces. Before we do that, let's tidy up the column names so that further code will be more readable.

Since we looked at rename earlier, let's look at a different way of achieving the same thing, whilst showing the different functionality of dplyr. We can add a new year column, to hold the year data. Also, we can add a new column called populationcount, which is a duplicate of the value column. We can do this simply by running the following commands to create new columns:

melteddata$PopulationCount <- melteddata$value
melteddata$Year <- melteddata$variable

Then, we can use the select command in dplyr to select the rest of the columns, removing the value column:

melteddata <- select(melteddata, select=-value, -variable)

This piece of code means that all of the columns held in the melteddata variable are selected except the value and variable columns, which are denoted with the minus in front of them. The remaining columns are assigned to the melteddata variable. We can use the slice command again to see what's contained in the melteddata variable, and you can see an example output as follows:

Summarizing the data with dplyr

Renamed output

Let's move forward to work with the PopulationCount data, so that we can remove the spaces. In order to do that, we will use the gsub pattern matching and replacement mechanism. The gsub function replaces all occurrences of a particular pattern in a string.

In order to use gsub, we have to specify that the data is a factor. The following code takes care of the conversion:

melteddata$PopulationCount <- as.factor(melteddata$PopulationCount)

The piece of code as.factor does the conversion part. It is more efficient than using strings. Factors can be used in statistical modeling, where they will be assigned the correct number of the degrees of freedom. Factors are also very useful when we are working with graphics in Tableau, because they will be easy for the business users to understand and use in their own analyses. As before, the new data is assigned to the old variable name.

In the following piece of code, we will use gsub for the substitution process. It will replace every instance of a space with nothing, thereby removing the spaces. Then, the changes will be assigned back to the PopulationCount column. The code is given as follows:

melteddata$PopulationCount <- gsub(" ","",melteddata$PopulationCount)

Once the spaces have been removed, let's change the data type back so that it is a numeric data type. Down the line, this means that the output is more likely to be recognized as a number by Tableau. For users, it will be easier for them to work with the data if it is presented to them conveniently, so that they can start to get insights at the earliest. Although Tableau makes it easy for business users to change data types, users prefer not to have any impedance in their way when working with data. The command is as follows:

melteddata$PopulationCount <- as.numeric(melteddata$PopulationCount)

We can slice the data again, in order to see how it looks now:

Summarizing the data with dplyr

Completed dataset

Once we have fixed our data, we can look at other activities on the data, such as grouping the data. We can use the summarise command in order to group the data. In this example, we are grouping the data so that we have the overall population mean. We have a variable called OverallPopulationmean, and it contains the overall count of rows, along with the mean of population. Here is the example code:

OverallPopulationmean <- summarise(melteddata, count=n(), 
                                   OverallPopulationmean = mean(melteddata$PopulationCount, na.rm=TRUE))

The result is given here:

  count OverallPopulationmean
1  5733              249386.9

However, this isn't an accurate picture, because we know that the data itself contains both summary and detail data. Let's write the data to a CSV file, and we can explore it further in Tableau:

write.csv(melteddata, "melteddata.csv")

When we import the cleansed data into Tableau, we can filter the data so that we can simply see the Region data. Here is an example worksheet:

Summarizing the data with dplyr

Sample Tableau Workbook with cleansed data

We have nice, clean line charts, which explain the message of the data. According to the projections, there is a steep projected rise in Africa, with a slight fall off in Europe.

What would happen if we hadn't cleansed the data? Well, let's give Tableau the dirty dataset, and let's see what happens.

We can import the CSV file straight from the World Data Bank website, and into Tableau. The first thing that we notice is that instead of having a nice Year dimension, all of the years appear as individual dimensions. Here is an example:

Summarizing the data with dplyr

Tableau import of the Years as individual dimensions

Unfortunately, this makes the data very difficult to navigate. The Tableau user would have to drag each year to the canvas. In the following example, let's take across the years from 2015 to 2030. Now let's see what happens:

Summarizing the data with dplyr

Tableau import of the dirty data

We can see here that the results are almost impossible for Tableau to visualize. The data is difficult for Tableau to interpret, and this result isn't going to be a good experience for the end users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset