The string fields are often the richest in data source as they include other fields with free text, but they can also have the lowest data quality. The string functions in Tableau Public empower you to perform splice
, trim
, find
, replace
, match
, reformat
, and concatenate
functions.
We will use several of these functions to clean up the floods data source, which was assembled manually and has several data quality issues.
The following are some of the most useful functions that we will use:
In the next few examples, as shown in the following screenshot, we will use the TRIM and REGEX functions to repair the quality of the most varied misspellings of several countries. It looks like many of the values in the Country Name source field have spaces on either side of the primary value, which is not great and it's something that we need to fix:
This is helpful. We now have fewer unique entries for country names than before, but bad characters are still present in the data-set.
There are two options for the replacement of characters in a string, namely REPLACE and REGEXP_REPLACE:
Regular expressions are efficient, and they require less hard-coding than the REPLACE function. This means that you will need to modify them less as your data set changes. This is important with data sets that you (or other people) are compiling because humans inevitably introduce errors. Anyone can fat-finger a country name when they're entering data.
The regular expression that we entered includes the string to search (Country source), the pattern to search, which is entered in quotation marks, and the replacement, which is blank, as shown in the following screenshot:
The pattern that we entered, which is also called a token, has several parts. This is shown in the following screenshot:
Everything within the quotation marks is a token. Working from the outside, the plus sign asks Tableau Public to match any of the expressions inside the brackets. Within the brackets, there are two expressions that need to be found, namely a period and a letter d
. These expressions are very different from each other; the period is a literal, which means that we want Tableau Public to find all the periods. The d
is a variable, and it means that we want Tableau Public to find numbers.
The two expressions, namely the period and the d
variable, are preceded by a backslash. This is an escape. The concept of escapes is beyond the scope of this book. However, it is critical to master them if you want to learn how to structure data because they tell programming applications where to break apart long strings of variable characters, such as URLs. In this case, they ask Tableau Public to look for exactly the expression that we have entered.
If you would like to master the advanced string functions of Tableau Public, which are portable to other programming languages and are a good investment of time in case you're planning on producing advanced analytics or would like to build a career in data science, check out Mark Jackson's (http://www.twitter.com/ugamarkj) blog post at http://ugamarkj.blogspot.com/2015/01/tableau-90-and-regular-expressions.html and Joshua Milligan's (http://www.twitter.com/vizpainter) post at http://vizpainter.com/my-favorite-tableau-9-0-feature/. Mark and Joshua are both Tableau Zen Masters. It's a coincidence that the search pattern in our example matches that of Mark's. In line with full disclosure, Joshua Milligan is one of the reviewers of this book and an author who has worked with Packt Publishing on a book on Tableau as well.