Other quantitative analysis

We will now shift gears to analyze text semantically based on sentences and the tagging of words based on the parts of speech, such as noun, verb, pronoun, adjective, adverb, preposition, singular, plural, and so on. Often, just examining the frequency and latent topics in the text will suffice for your analysis. However, you may find occasions when a deeper understanding of the style is required in order to compare the speakers or writers.

There are many methods to accomplish this task, but we will focus on the following five:

  • Polarity (sentiment analysis)
  • Automated readability index (complexity)
  • Formality
  • Diversity
  • Dispersion

Polarity is often referred to as sentiment analysis, which tells you how positive or negative the text is. By analyzing polarity in R , it will assign a score to each word and you can analyze the average and standard deviation of polarity by groups such as different authors, text, or topics. Different polarity dictionaries are available and we will explore them in more detail later. You can alter or change a dictionary according to your requirements.

The algorithm works by first tagging the words with a positive, negative, or neutral sentiment based on the dictionary. The tagged words are then clustered based on the four words prior and two words after a tagged word, and these clusters are tagged with what are known as valence shifters (neutral, negator, amplifier, and de-amplifier). A series of weights based on their number and position are applied to both the words and clusters. This is then summed and divided by the square root of the number of words in that sentence.

The automated readability index is a measure of the text complexity and a reader's ability to understand. A specific formula is used to calculate this index: 4.71(# of characters / #of words) + 0.5(# of words / # of sentences) - 21.43.

The index produces a number, which is a rough estimate of a student's grade level to fully comprehend. If the number is 9, then a high school freshman, aged 13 to 15, should be able to grasp the meaning of the text.

The formality measure provides an understanding of how a text relates to the reader or speech relates to a listener. I like to think of it as a way to understand how comfortable the person producing the text is with the audience, or an understanding of the setting where this communication takes place. If you want to experience formal text, attend a medical conference or read a legal document. The informal text is said to be contextual in nature.

The formality measure is called F-Measure. This measure is calculated as follows:

  • Formal words (f) are nouns, adjectives, prepositions, and articles
  • Contextual words (c) are pronouns, verbs, adverbs, and interjections
  • N = sum of (f + c + conjunctions)
  • Formality Index = 50((sum of f - sum of c / N) + 1)

Diversity, as it relates to text mining, refers to the number of different words used in relation to the total number of words used. This can also mean the expanse of the text producer's vocabulary or lexicon richness. The qdap package provides five—that's right, five—different measures of diversity: simpson, shannon, collision, bergen_parker, and brillouin. I won't cover these five in detail but will only say that the algorithms are used not only for communication and information science retrieval but also for biodiversity in nature.

Finally, dispersion, or lexical dispersion, is a useful tool in order to understand how words are spread throughout a document and serve as an excellent way to explore text and identify patterns. The analysis is conducted by calling the specific word or words of interest, which are then produced in a plot showing when the word or words occurred in the text over time. As we will see, the qdap package has a built-in plotting function to analyze the text dispersion.

We have covered a framework on text mining about how to prepare the text, count words, and create topic models and, finally, dived deep into other lexical measures. Now, let's apply all this and do some real-world text mining.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset