The string functions

The string fields are often the richest in data source as they include other fields with free text, but they can also have the lowest data quality. The string functions in Tableau Public empower you to perform splice, trim, find, replace, match, reformat, and concatenate functions.

We will use several of these functions to clean up the floods data source, which was assembled manually and has several data quality issues.

The following are some of the most useful functions that we will use:

  • CONTAINS: This has a Boolean output. It tests whether a field contains the specified string.
  • FIND: This finds the place where a string of characters is located within a field.
  • LEFT, MID, and RIGHT: These often work in conjunction with FIND and LEN when extracting fixed or variable strings of characters from a field.
  • LEN: This produces the length of a field.
  • MIN and MAX: These are commonly used on numeric fields. They produce the numerically minimum or maximum values in sequences respectively.
  • The REGEX expressions: These features were introduced in Tableau Public 9.0. They extract, match, and replace variable strings within fields and are similar to the LIKE function in ANSI SQL.
  • REPLACE: This replaces a sequence with a specified value.
  • TRIM, LTRIM, and RTRIM: These trim leading or lagging spaces from a string.
  • UPPER and LOWER: These are commonly used to normalize the contents of a field.

    Note

    The Tableau Public does not have the PROPER function that both ANSI SQL and Microsoft Excel use to impose proper capitalization on inconsistently capitalized fields.

In the next few examples, as shown in the following screenshot, we will use the TRIM and REGEX functions to repair the quality of the most varied misspellings of several countries. It looks like many of the values in the Country Name source field have spaces on either side of the primary value, which is not great and it's something that we need to fix:

The string functions

This is helpful. We now have fewer unique entries for country names than before, but bad characters are still present in the data-set.

There are two options for the replacement of characters in a string, namely REPLACE and REGEXP_REPLACE:

  • REPLACE: This requires you to enter the string to be searched, the exact pattern to be replaced, and the replacement value. The advantage of using REPLACE is that it is easy to learn and execute. The disadvantage is that Tableau Public can search and replace only one string at a time.
  • REGEXP_REPLACE: This is a robust function, and while it takes practice to master it, this is widely used in different programming languages. It is a type of regular expression. Among the technical users of Tableau Public, the introduction of regular expressions in 9.x was highly anticipated because the search pattern allows a high degree of variability. You can enter a specific letter, number, or special character as the pattern that you need to search, or you can tell Tableau Public to find letters, numbers, or a variety of special characters. You can also ask Tableau Public to find a combination of these.

Regular expressions are efficient, and they require less hard-coding than the REPLACE function. This means that you will need to modify them less as your data set changes. This is important with data sets that you (or other people) are compiling because humans inevitably introduce errors. Anyone can fat-finger a country name when they're entering data.

The regular expression that we entered includes the string to search (Country source), the pattern to search, which is entered in quotation marks, and the replacement, which is blank, as shown in the following screenshot:

The string functions

The pattern that we entered, which is also called a token, has several parts. This is shown in the following screenshot:

The string functions

Everything within the quotation marks is a token. Working from the outside, the plus sign asks Tableau Public to match any of the expressions inside the brackets. Within the brackets, there are two expressions that need to be found, namely a period and a letter d. These expressions are very different from each other; the period is a literal, which means that we want Tableau Public to find all the periods. The d is a variable, and it means that we want Tableau Public to find numbers.

The two expressions, namely the period and the d variable, are preceded by a backslash. This is an escape. The concept of escapes is beyond the scope of this book. However, it is critical to master them if you want to learn how to structure data because they tell programming applications where to break apart long strings of variable characters, such as URLs. In this case, they ask Tableau Public to look for exactly the expression that we have entered.

If you would like to master the advanced string functions of Tableau Public, which are portable to other programming languages and are a good investment of time in case you're planning on producing advanced analytics or would like to build a career in data science, check out Mark Jackson's (http://www.twitter.com/ugamarkj) blog post at http://ugamarkj.blogspot.com/2015/01/tableau-90-and-regular-expressions.html and Joshua Milligan's (http://www.twitter.com/vizpainter) post at http://vizpainter.com/my-favorite-tableau-9-0-feature/. Mark and Joshua are both Tableau Zen Masters. It's a coincidence that the search pattern in our example matches that of Mark's. In line with full disclosure, Joshua Milligan is one of the reviewers of this book and an author who has worked with Packt Publishing on a book on Tableau as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset