Extending Watson

We've already seen how IBM Watson can be used to increase the value of or extend the results of your data analytics efforts through easy exploration and visualization. This is accomplished with Watson's cutting-edge ability to understand your typed questions and provide unique, interactive visualizations that are based on your words. In the preceding section, we looked at ways by which you can customize your Watson experience. Now let's consider the prospects for extending Watson.

With all that Watson can do, it is still dependent on data quality. In other words, the better the data, the better the results.

Data quality

When you hear the term "data quality" in conjunction with Watson analytics, it refers to how apt certain data is for performing analyses. In other words, how reasonable might the results be if one performs an analysis with the data file?

In Chapter 2, Identifying Use Cases, we briefly mentioned that Watson assigns a data quality score to each dataset when you upload it. This score (from 0 to 100) is an assessment of the data's quality based on a computed average of the data quality of each field or column in the data file.

To compute the data quality score for a field in a data file, Watson considers the following factors:

  • Missing values: Records or transactions that have no data entered in a particular field will affect the data's quality score. The more records with missing values, the lower the data quality score.
  • Constant values: Some fields in your file may have the same value for every field, making the data quality score for these fields zero.
  • Imbalance: An imbalance condition can occur if fields in a data file are defined as categorical and records are not equally distributed across the defined categories—this is known as unequal frequency. The more the unequal frequencies, the lower the data quality.
  • Influential categories: Influential categories are categories that are significantly different from other categories and therefore have more influence over the field. As the influence of these categories increases, the data quality decreases.
  • Outliers: Outliers are the extreme values found in a field (as compared to the average values of a field). The more the outliers, the lower the data quality.
  • Skewness: Skewness measures how symmetrically a continuous field is distributed. Skewed fields have lower data quality scores.

Watson data metrics

In Chapter 2, Identifying Use Cases, we pointed out that Watson provides you with the ability to both explore and refine your data file. Recall that you can review and tune your data file to match the way you want to see or work with it, and any changes that you make are saved as a separate version of the original data file.

Let's look again at the Historic_Stadium_Sales file that we previously loaded into Watson:

Watson data metrics

Click on the file (as shown in the preceding screenshot) and then click on Refine, as shown here:

Watson data metrics

With our file displayed on the refine page (shown next), we have to click on the Data Metrics icon:

Watson data metrics

Watson then displays the data quality for each column of data:

Watson data metrics

If you look closely at the Gameday Weather column, you'll see that the data quality score is zero:

Watson data metrics

This is because the value of the column of data is constant for all records (Sunny is the only value found). As we discussed earlier in this section, this affects the data quality of this column and ultimately the data quality of the entire file.

Here is another example of data quality. In the same file, Historic_Stadium_Sales, the column named Payment Method has a data quality score of 97, as shown in this screenshot:

Watson data metrics

In a new version of this file that I uploaded, Historic_Stadium_Sales_MissingValues, the same column has a much lower data quality score, which is 71. If you look at the following score, you will see that Watson has found that 3% of the records in this file have no value (they are missing) for the column:

Watson data metrics
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset