What is dplyr
? Well, dplyr
can be perceived as a grammar of data manipulation. It has been created for the R community by Hadley Wickham, Romain Francois, and RStudio.
What does dplyr
give the Tableau user? We will use dplyr
in order to cleanse, summarize, group, chain, filter, and visualize our data in Tableau.
Firstly, let's import the packages that we need. These packages are listed in the following table, followed by the code itself.
Packages required for the hands-on exercise:
Package Name |
Description |
Reference |
---|---|---|
|
Search, extract, and format data from the World Bank's World Development Indicators | |
|
|
As we walk through the script, the first thing we need to do is install the packages.
Once you have installed the packages, we need to call each library.
Once we have called the libraries, then we need to obtain the data from the World Data Bank website. When we have downloaded the data, we can assign it to a variable. In this example, we will call the variable dat
. Before we do any further analysis, we will summarize the data using the summary
command.
To start working with the data, let's use the summary
command. In this example, the summary
command is given here:
summary(dat)
In the next step, let's use the head
command in order to see some of the data. Here is the command:
head(dat)
With the dplyr
package, we can filter data so that we only see the data that we would like to see. For example, for the World Development Indicators (WDI) data, we may wish to see data that is labeled for the whole world. This means that it would match the word WORLD
. In the next command, we can see the filtered data. The filter
command looks for data that matches the condition in the code. In this example, the data is scanned for rows where the region is shown to be set to world:
dat[dat$Major.area..region..country.or.area.. == "WORLD"]
When we use the slice
command, we can see that the data is shown for the rows that you specify in the command line.
This means that we can slice this data so that it only shows data that appears within the number of rows by position. In this example, we restrict the slice
command so that it shows only the first three lines of code. The slice
command is nice and neat. In contrast, if we were to rewrite the slice
command into the corresponding command in R, the command would be less readable and it would be much longer.
We can also reorder data by using the arrange
command. This piece of code will order the data along the lines of the columns that is stated in the function's arguments. So, for example, the following code will reorder the data using the Country.code
column:
arrange(dat, Country.code)
We can also choose columns by using the select
command. This command specifies which columns we would like to return. It allows us to rename the column names using variables, using the named arguments specified in the piece of code. Here is an example, where some of the column names are renamed.
However, the select
command drops all the variables that are not explicitly stated in the code. Unfortunately, this renders the select
command to be not that useful. Instead, we can use the rename
command to achieve the same thing as the select
command, but the variables are all retained, whether they are explicitly stated or not. Here is an example of the rename command in use:
rename(dat,Index =Index,Variant=Variant,region=Major.area..region..country.or.area.., Notes=Notes, Countrycode=Country.code )
The select
command is very useful when it is combined with the distinct
command. The distinct
command is designed to return the unique values in a data frame. The select
command is used to retrieve data for the specified columns, and it is combined with the distinct
command to return the unique values for the selected
columns in a data frame. Here is an example where we use the select
and distinct
commands together, and the results are assigned to a variable called sampledat
:
sampledat <- distinct(select(dat,Index,Variant,region=Major.area..region..country.or.area.., Notes, Countrycode=Country.code, X2015=X2015, X2016=X2016, X2017=X2017 ) )
Let's take a look at the output of the sampledat
variable using the head command:
head(sampledat)
Now, let's repeat the command for the first 20 years in the dataset, from 2015
until 2035
. To do this, we will execute a piece of code, which does the following things:
It selects the relevant columns, and renames some of the columns so that they are more readable. The year
column names are renamed to remove the letter X
prefix, and the year names are turned into characters by virtue of being enclosed in quotation marks. Then, the code assigns the result to a variable called distinctdat
:
distinctdat <- distinct(select(dat,Index=Index,Variant=Variant,Region=Major.area..region..country.or.area.., Notes, Countrycode=Country.code, "2015"=X2015, "2016"=X2016,"2017"=X2017,"2018"=X2018,"2019"=X2019,"2020"=X2020, "2021"=X2021,"2022"=X2022,"2023"=X2023,"2024"=X2024,"2025"=X2025,"2026"=X2026,"2027"=X2027,"2028"=X2028,"2029"=X2029, "2030"=X2030,"2031"=X2031,"2032"=X2032,"2033"=X2033,"2034"=X2034,"2035"=X2035))
Let's take a look at the resulting data, which is stored in the distinctdat
variable, using the previous head
command:
head(distinctdat)
This is an excerpt of the result set, contained in the distinctdat
variable:
In the column headed 2015, it's possible to see that the values contain spaces. If we were to write this data to a CSV file, only the numbers after the space would be written to the file. This means that we will need to do something about the spaces before that point.
We could use a substitution command, such as gsub
, to remove the spaces for every year, from 2015 right through to 2035. However, this would mean repeating the command for each year. We could also write a function for this purpose, and call it for every year.
Although this method would work, the resulting data would not appear nicely in Tableau. The reason for this is that each year is still treated separately, even though the actual metric is the same. Ideally, it's better to unpivot the data.
Unpivoting columns creates an attribute
column for each selected column heading and a value
column for each column cell value. The attribute-value pair columns are inserted after the last column. In R, we can unpivot data using the melt command.
In our example, we would like to unpivot the data held in the distinctdat
variable along the attribute columns, which are Index
, Variant
, Region
, Notes
, and Countrycode
. The other rows, which hold the data from 2015 to the year 2035, would all be placed into two columns. One column will hold the year, and the other column will hold the value of the projected world population. This unpivot result is achieved with the following code:
melteddata <- melt(distinctdat, id=c("Index","Variant","Region","Notes","Countrycode"))
In the preceding example, each date column becomes a row with an attribute
column containing the date value and a value
column containing the date column value. So, if we run the preceding command, then data is stored in the melteddata
variable.
If we slice
the melteddata
variable, then we can see the results more clearly:
From the preceding example code output, we can see that the year
data now appears in the variable column. The population count data is now held in the value column.
Now it's possible to work with the value column in order to remove the spaces. Before we do that, let's tidy up the column names so that further code will be more readable.
Since we looked at rename earlier, let's look at a different way of achieving the same thing, whilst showing the different functionality of dplyr
. We can add a new year
column, to hold the year
data. Also, we can add a new column called populationcount
, which is a duplicate of the value column. We can do this simply by running the following commands to create new columns:
melteddata$PopulationCount <- melteddata$value melteddata$Year <- melteddata$variable
Then, we can use the select command in dplyr
to select the rest of the columns, removing the value column:
melteddata <- select(melteddata, select=-value, -variable)
This piece of code means that all of the columns held in the melteddata
variable are selected except the value and variable columns, which are denoted with the minus in front of them. The remaining columns are assigned to the melteddata
variable. We can use the slice
command again to see what's contained in the melteddata
variable, and you can see an example output as follows:
Let's move forward to work with the PopulationCount
data, so that we can remove the spaces. In order to do that, we will use the gsub
pattern matching and replacement mechanism. The gsub
function replaces all occurrences of a particular pattern in a string.
In order to use gsub
, we have to specify that the data is a factor. The following code takes care of the conversion:
melteddata$PopulationCount <- as.factor(melteddata$PopulationCount)
The piece of code as.factor
does the conversion part. It is more efficient than using strings. Factors can be used in statistical modeling, where they will be assigned the correct number of the degrees of freedom. Factors are also very useful when we are working with graphics in Tableau, because they will be easy for the business users to understand and use in their own analyses. As before, the new data is assigned to the old variable name.
In the following piece of code, we will use gsub
for the substitution process. It will replace every instance of a space with nothing, thereby removing the spaces. Then, the changes will be assigned back to the PopulationCount
column. The code is given as follows:
melteddata$PopulationCount <- gsub(" ","",melteddata$PopulationCount)
Once the spaces have been removed, let's change the data type back so that it is a numeric data type. Down the line, this means that the output is more likely to be recognized as a number by Tableau. For users, it will be easier for them to work with the data if it is presented to them conveniently, so that they can start to get insights at the earliest. Although Tableau makes it easy for business users to change data types, users prefer not to have any impedance in their way when working with data. The command is as follows:
melteddata$PopulationCount <- as.numeric(melteddata$PopulationCount)
We can slice the data again, in order to see how it looks now:
Once we have fixed our data, we can look at other activities on the data, such as grouping the data. We can use the summarise
command in order to group the data. In this example, we are grouping the data so that we have the overall population mean. We have a variable called OverallPopulationmean
, and it contains the overall count of rows, along with the mean of population. Here is the example code:
OverallPopulationmean <- summarise(melteddata, count=n(), OverallPopulationmean = mean(melteddata$PopulationCount, na.rm=TRUE))
The result is given here:
count OverallPopulationmean 1 5733 249386.9
However, this isn't an accurate picture, because we know that the data itself contains both summary and detail data. Let's write the data to a CSV file, and we can explore it further in Tableau:
write.csv(melteddata, "melteddata.csv")
When we import the cleansed data into Tableau, we can filter the data so that we can simply see the Region data. Here is an example worksheet:
We have nice, clean line charts, which explain the message of the data. According to the projections, there is a steep projected rise in Africa, with a slight fall off in Europe.
What would happen if we hadn't cleansed the data? Well, let's give Tableau the dirty dataset, and let's see what happens.
We can import the CSV file straight from the World Data Bank website, and into Tableau. The first thing that we notice is that instead of having a nice Year dimension, all of the years appear as individual dimensions. Here is an example:
Unfortunately, this makes the data very difficult to navigate. The Tableau user would have to drag each year to the canvas. In the following example, let's take across the years from 2015 to 2030. Now let's see what happens:
We can see here that the results are almost impossible for Tableau to visualize. The data is difficult for Tableau to interpret, and this result isn't going to be a good experience for the end users.