In this chapter, we will talk about the practice of content analysis and how IBM Watson Analytics can be used as a tool to help analyze big data.
This chapter is organized in the following way:
Content Analytics focuses on textual data that is typically difficult to analyze. Because of its unformatted and ambiguous nature, potential insights hidden within this data are often never realized. Generally, using commonly-known automated methods (to understand this type of data) is at best, difficult. This leaves a potentially large body of opportunity not considered when making key decisions. Previously, attempts to include this data in decision making were dependent upon great manual efforts.
For example, by reading customer satisfaction surveys one by one, an organization may gain an understanding of what individual customers are thinking, but may not know if satisfied customers are unique cases or common cases, or if such cases are increasing or decreasing. This kind of understanding can be acquired only by analyzing the data set as a whole. However, textual data will generally contain huge varieties of information – some valuable and some not - making it important to focus on only the valuable information.
Content Analytics is designed to help with this. It allows efficient analysis of entire data sets to find patterns and trends, and supports the identification of those particular patterns and trends that are important to you. This allows you to reach greater levels of awareness and achievement with your data.
We shouldn't have to take too much time to convince you that trying to manually evaluate significant amounts of textual data is problematic and expensive.
The generally accepted process of analyzing data consists of:
This sounds simple, but problems occur:
During manual analysis (of textual data), analysts usually randomly select a more manageable (but perhaps not an actual representation) subset of the data and can potentially misclassify data into groups, based upon ambiguous understandings.
The idea of using Content Analytics is to use an automated means to analyze your textual data, easily processing thousands more records than a manual process ever could. The objective is to change the analyst from a document classifier and chart maker to an interpreter of the analytical results.
Since textual data is, in fact, text or words and phrases written for and intended for humans rather machines, it is challenging for automated processing to always correctly interpret it. For example:
You can see that the preceding challenges (and others) might cause the analytical results to not always match with your expectations.
Humans performing manual analysis are the best way to improve on the accuracy of your text analytic process, but it is impractical as the volume of your textural data increases. One might suggest adding more people to the process but different people (in fact, even the same analyst) can produce completely different results over time.
The way that Content Analytics works—using definitions and pattern matching rules—makes it a much more productive process than the manual alternative. In addition, it is important to understand that using Content Analytics keeps the criteria for interpretation the same for the entire data set. This allows the analysts to spend more time analyzing results and perhaps taking action based upon the examination of all of the data, even large amounts of data, which leads to a higher value result.
Another area where Content Analytics can improve the analytical process and ultimately the results is with dealing with frequency and deviation.
Frequency is the number of times that contain keywords are identified within unique responses or records. Deviation refers to the change or changes in responses (or the number of occurrences of those keywords).
Without considering the entire data set and identifying and understanding relationships within the data, the results produced may still not reflect reality.
Again, Watson Analytics Content Analytics ensures a much broader analysis of all of the data to discover new relationships and provide more reliable insights.
Finally, results based upon large amounts of textual data are dependent on the factors precision (accuracy) and recall (coverage). Precision is defined as the ratio of the correctly returned results versus the total returned results or the number of true positives. Recall is defined as the ratio of the correctly returned results versus the total number of correct results in the data set.
In general, aiming for higher recall is more challenging than aiming for higher precision. To achieve higher precision, the goal is noise elimination; for higher recall, the idea is to search extensively to be sure to capture all relevant information.
Dealing with precision and recall is common in Content Analytics. Tools like IBM SPSS offer options for preprocessing your data to improve precision and recall. In Watson Analytics, as we mentioned in an earlier chapter, you can use the Refine feature to help with this.