Blending data sources

If you want to create a calculated field that produces the percentage of the total land flooded during an event, we need to use a data source that has this information in it. Most stories need more than one data source. When you are adding in secondary data sources, it's important to know what the level of aggregation is if you have joined data sources with different levels of aggregation and have not accounted for that when blending them because if you fail to do so, you might get the wrong results in your calculated fields.

The flood data is aggregated at the event level. This means that for every major flood, we have summarized data—the start date, end date, location, total land area affected, and so on. This is moderately granular data. If it were even more summarized, it might be summed up as floods per year per country.

We want to know the total percent of land flooded each year, and for this, we will assume that the area of each country is static throughout time. So, we created a new data source. We filtered the World Bank indicators data (available at http://data.worldbank.org/data-catalog/world-development-indicators/) in this data source to include only the indicators pertaining to the land area or population. The World Bank indicator data is aggregated by country by year, which is not the same level of aggregation as that of the data on floods; it is higher. So, in our calculated field, which will sum the total area flooded and then divide it by the total area of each country, we will need to tell Tableau how to aggregate the total area of each country. Otherwise, we will get the wrong results.

First, we need to join the data sources. In the companion workbook, which can be found at [X], we have a data source called Population and Land Data. This is the version of the World Bank indicators that we have filtered. It has only four dimensions, in this case, we have three, Country, Indicator Name, and Year. The data source on floods has Country Name and several other date fields. We want to join on Country name, as shown in the following screenshot:

Blending data sources

If we look at the Population and Land Data source, we do not see any of the gray chain links next to our dimensions that indicate that the fields are linked. There's a reason behind this. The names of the fields are not exactly the same. In order for Tableau Public to identify the join conditions automatically, the field names need to be exactly the same, with both capitalization, punctuation, and content. The field types need to be exactly the same as well.

From the Data menu, we can create a join on Country name in the Floods, as follows:

  1. Click on Edit Relationships.
  2. In the Edit Relationships dialog box, which is shown in the following screenshot, ensure that the appropriate data source is selected as the primary data source, which is the data source from which the first field is added to the visualization.
  3. Select the appropriate secondary data source.
  4. Click on the Custom radio button in case it is not selected by default. In this case, there are no automatic joins. Therefore, Tableau Public assumes that we need custom joins.
  5. Click on Add:
    Blending data sources

Next, we need to tell Tableau Public that the Country Name field in the major global floods data, which was created to remediate data quality issues, is the same as the Country field in the major global floods data source, as shown in the following screenshot:

Blending data sources

You can create only one custom join at a time. So for instance, if you would like to add more joins, such as a date field, you need to add each one individually by clicking on the OK button and then clicking on the Add button again, which we did in order to add a second join on the year of the event.

Note the small arrow next to the Began field, which is above Country Name. This field allows you to establish a join condition on a specific date part. Since we also wanted to join the data sources on the year of the flood event, we expanded the list to see the appropriate dateparts, clicked on Year, and then selected Year in the secondary data source.

We clicked on OK again to return to the visualization, which is still the map that we created that shows the average flooded land.

The view is now different. Looking at the Data window for the Population and Land Data source, you will see that that there are two chain links, the chain link for Country, and the following is orange. Because the linking field is a part of the visualization, you will see that Country Name is on the Detail shelf on the Marks card. However, Year has a gray, broken chain link because it is not a part of the visualization. If you want it to be used in the calculated fields, click on the gray chain link to activate the join. Even though the field might not be on the visualization, it still be a part of calculations using fields from its data source, as shown in the following screenshot:

Blending data sources

We can create a calculated field that produces the average amount of land in each country that was flooded. In this case, since some countries might have had major floods that add up to more than 100 percent of the total flooded land, we will create the maximum percent of land flooded, as shown in the following screenshot:

Blending data sources

In order to create this calculation, perform the following steps:

  1. Create a new calculated field in the major global floods data source.

    Note

    It is important that the place where we create the new calculated field determines the data source that is the primary data source.

  2. Name the field.
  3. Add Affected sq km to the numerator by dragging it into the dialog box.
  4. Wrap it in a max function.
  5. Add a division sign (/) because we want to divide it by the total area of the country.
  6. Click on the Population and Land Data source and drag Area of Country to the denominator.
  7. The denominator is automatically summed. You can change the sum to whatever aggregation you would like, but if you do not aggregate the fields that are from the secondary data sources, you will get error messages.
  8. Change the aggregation of the field from the secondary data source, which is the field whose name is preceded by its data source name. This is similar to a SQL query that is used to find out an average. The reason behind why we did this is that for every record in the partition, Tableau Public needs to know what to do with the denominator. If we sum it, we will not have the proper calculation.

Check out the correct calculated field, It's important to learn how to handle errors. The next two errors are very common and are as follows:

  • The following screenshot shows that we cannot mix aggregate and non-aggregate arguments with this function, with the division function highlighted. This means that in a calculated field, both arguments must be aggregated. We can resolve this by aggregating the numerator or by using the ATTR function on it. ATTR tells Tableau to use the discrete value for it:
    Blending data sources
  • All fields must be aggregate or constant when using table calculation functions or fields from multiple data sources. This means that the denominator, which is from a secondary data source, must be aggregated. Tableau Public requires this so that it knows how to aggregate. We resolved this by placing the AVG function around the denominator and adding an aggregation to the denominator:
    Blending data sources
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset