Chapter 4. Understanding Content Analysis

In this chapter, we will talk about the practice of content analysis and how IBM Watson Analytics can be used as a tool to help analyze big data.

This chapter is organized in the following way:

  • Basic concepts of Content Analytics
  • Cycle of analysis with Content Analytics using Watson Analytics
  • An illustrative use case

Basic concepts of Content Analytics

Content Analytics focuses on textual data that is typically difficult to analyze. Because of its unformatted and ambiguous nature, potential insights hidden within this data are often never realized. Generally, using commonly-known automated methods (to understand this type of data) is at best, difficult. This leaves a potentially large body of opportunity not considered when making key decisions. Previously, attempts to include this data in decision making were dependent upon great manual efforts.

For example, by reading customer satisfaction surveys one by one, an organization may gain an understanding of what individual customers are thinking, but may not know if satisfied customers are unique cases or common cases, or if such cases are increasing or decreasing. This kind of understanding can be acquired only by analyzing the data set as a whole. However, textual data will generally contain huge varieties of information – some valuable and some not - making it important to focus on only the valuable information.

Content Analytics is designed to help with this. It allows efficient analysis of entire data sets to find patterns and trends, and supports the identification of those particular patterns and trends that are important to you. This allows you to reach greater levels of awareness and achievement with your data.

Manual or automation

We shouldn't have to take too much time to convince you that trying to manually evaluate significant amounts of textual data is problematic and expensive.

The generally accepted process of analyzing data consists of:

  • Classifying the data into groups with predefined characteristics (also known as setting the objectives for the analysis)
  • Totaling the numbers found that fit each group
  • Presenting the totals

This sounds simple, but problems occur:

  • The amount of data becomes large
  • Data is ambiguous and isn't easily classified into a predefined group

During manual analysis (of textual data), analysts usually randomly select a more manageable (but perhaps not an actual representation) subset of the data and can potentially misclassify data into groups, based upon ambiguous understandings.

The idea of using Content Analytics is to use an automated means to analyze your textual data, easily processing thousands more records than a manual process ever could. The objective is to change the analyst from a document classifier and chart maker to an interpreter of the analytical results.

Difficulties with textual analysis

Since textual data is, in fact, text or words and phrases written for and intended for humans rather machines, it is challenging for automated processing to always correctly interpret it. For example:

  • A phrase can be interpreted in different ways
  • Some words can be both an adjective or a noun
  • Words may have different meanings, such as a name of a person or place
  • Words could be used as past tenses or as an object
  • Some key words in a phrase may affect the overall meaning of the phrase
  • Words and concepts may often be overly ambiguous
  • Misspellings often result in ambiguity
  • Order or timeliness may change the acceptable meaning of a term

You can see that the preceding challenges (and others) might cause the analytical results to not always match with your expectations.

Humans performing manual analysis are the best way to improve on the accuracy of your text analytic process, but it is impractical as the volume of your textural data increases. One might suggest adding more people to the process but different people (in fact, even the same analyst) can produce completely different results over time.

The way that Content Analytics works—using definitions and pattern matching rules—makes it a much more productive process than the manual alternative. In addition, it is important to understand that using Content Analytics keeps the criteria for interpretation the same for the entire data set. This allows the analysts to spend more time analyzing results and perhaps taking action based upon the examination of all of the data, even large amounts of data, which leads to a higher value result.

Frequency and deviation

Another area where Content Analytics can improve the analytical process and ultimately the results is with dealing with frequency and deviation.

Frequency is the number of times that contain keywords are identified within unique responses or records. Deviation refers to the change or changes in responses (or the number of occurrences of those keywords).

Without considering the entire data set and identifying and understanding relationships within the data, the results produced may still not reflect reality.

Again, Watson Analytics Content Analytics ensures a much broader analysis of all of the data to discover new relationships and provide more reliable insights.

Precision and recall

Finally, results based upon large amounts of textual data are dependent on the factors precision (accuracy) and recall (coverage). Precision is defined as the ratio of the correctly returned results versus the total returned results or the number of true positives. Recall is defined as the ratio of the correctly returned results versus the total number of correct results in the data set.

In general, aiming for higher recall is more challenging than aiming for higher precision. To achieve higher precision, the goal is noise elimination; for higher recall, the idea is to search extensively to be sure to capture all relevant information.

Dealing with precision and recall is common in Content Analytics. Tools like IBM SPSS offer options for preprocessing your data to improve precision and recall. In Watson Analytics, as we mentioned in an earlier chapter, you can use the Refine feature to help with this.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset