Looking for context in text – analyzing document n-grams

What was the main limitation of our wordclouds? As we said, the absence of context. In other words, we were looking at isolated words, which don't help us to derive any meaning apart from the limited meaning contained within the single words themselves.

This is where n-gram analysis techniques come in. These techniques basically involve tokenizing the text into groups of words rather than into single words. These groups of words are called n-grams.

We can obtain n-grams from our comments dataset by simply applying the unnest_tokens function again, but this time passing "ngrams" as value to the token argument and 2 as the value to the n argument:

comments %>% 
unnest_tokens(bigram, text, token = "ngrams", n = 2) -> bigram_comments

Since we specified 2 as the value for the n argument, we are extracting here what are called bigrams, couples of words that appear adjacent to each other in a text. Let's try to count the frequency of our bigram now. We are also removing the stop_words from our dataset:

bigram_comments %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
count(word1, word2, sort = TRUE)

Since we specified sort = TRUE within the count function, what we are looking at is a sorted list of n-gram weights based on their frequency in the whole dataset.

What can you conclude from this? There are some messages that are now starting to become evidences in some manner:

  • Our colleagues are complaining of unjustified discounts being given to the companies
  • There are problems with the relevant contact persons
  • There are problems with payments related to this company

Is this enough? What if we now try to perform the same kind of analysis but employ three rather than two words? Let's try this, moving from bigrams to trigrams:

comments %>% 
unnest_tokens(trigram, text, token = "ngrams", n = 3) -> trigram_comments

We now have a group of three words for each line. Let's split this into three columns, named word1, word2, and word3, and leave the rest as before in order to obtain the frequency count:

 trigram_comments %>%
separate(trigram, c("word1", "word2","word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)

This is definitely adding context. We now know that a relevant and common problem is related to contact persons not being easily reachable and payments getting delayed. I think we can consider our analysis of the unstructured part of the text concluded. Let's move to the information data frame to see if the information contained within that dataset can further highlight the commonalities between these companies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset