Developing wordclouds from text

We can make our first attempt to look at these words using the wordcloud package, which basically lets you obtain what you are thinking of: wordclouds.

To create a wordcloud, we just have to call the wordcloud() function, which requires two arguments:

  • words: The words to be plotted
  • frequency: The number of occurrences of each word

Let's do it:

comments_tidy %>%
count(word) %>%
with(wordcloud(word, n))

Reproduced in the plot are all the words stored within the comments_tidy object, with a size proportionate to their frequency. You should also be aware that the position of each word has no particular meaning hear.

What do you think about it? Not too bad, isn't it? Nevertheless, I can see too many irrelevant words, such as we and with. These words do not actually convey any useful information about the content of the comments, and because they are quite frequent, they are obscuring the relevance of other, more meaningful, words.

We should therefore remove them, but how do we do this? The most popular way is to compare the analyzed dataset with a set of so-called stop words, which basically are words of the kind we were talking about earlier—they have no relevant meaning and high frequency. Tidytext provides a built-in list of stop_words that we can use.

Let's have a look at it:

stop_words 

We can now filter our comments_tidy object based on the word column:

comments_tidy %>%
filter(!word %in% stop_words$word) %>%
count(word) %>%
with(wordcloud(word, n))

The wordcloud is now a more useful instrument for our analysis. What would you conclude by looking at it? Beside the quite obvious company, we find contact and person, which probably show up together as contact person. We find even more relevant words such as delay, discount, difficult, unjustified, and revise.

Those words confirm that something bad is going on with these companies, even if they fail to provide the relevant context. Looking at the wordcloud, we can conclude that the topics of the discussions relate to the contact persons, discounts, and delays in payments.

How do we obtain more context? We are going to look at n-grams for that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset