Creating word clouds with the wordcloud package

Word clouds are a nice and useful way to show text composition at a glance.

In a word cloud, words composing a text are composed in a kind of cloud, and usually their size and color is related to the frequency of the given term in the source text.

In this way, it is possible to understand quickly which words are more relevant to the given text. In this recipe, we will explore the Wikipedia page related to the R programming language.

Getting ready

We first need to install the required packages and load them into the R environment:

install.packages(c("wordcloud","RColorBrewer","rvest"))
library(wordcloud)
library(rvest)
library(RColorBrewer)

How to do it...

  1. Define your document URL and download it in the R environment:
    url <- "https://en.wikipedia.org/wiki/R_(programming_language)"
    page <- read_html(url)
    page <- html_text(page,trim = TRUE)
    page <- gsub("
    ","",page, fixed = FALSE)
    page <- gsub("	","",page, fixed = FALSE)
    
  2. Print out your word cloud:
    wordcloud(page)
    
  3. Filter for most frequent terms:
    wordcloud(page,min.freq = 4)
    
  4. Change the color combination of words:
    palette <- brewer.pal(n = 9, "Paired")
    wordcloud(page,colors = palette)
    

    Let's take a look at the following image:

    How to do it...

How it works...

Our first step involves storing the Wikipedia page URL in an on-purpose variable and downloading the HTML code stored on that page.

HTML reading is done through the read_html() function from the rvest package.

After performing this step, we downloaded all the HTML code, including the HTML tags such as <h1> and <a href>.

In order to remove these tags and focus on the proper text, we just have to run html_text() on the created page object.

We then remove and , since they are just escaping characters.

In step 2, we print out our word cloud. Creating our word cloud is as easy as running the wordcloud() function on the page object.

Be aware that some default argument values apply here:

  • The size of each word in the cloud is proportional to the frequency of the word within the text object. If not specified, the frequency argument would be automatically computed within the function.
  • The min.freq argument is set to 3 by default, meaning that words that appear less than three times will not show up within your word cloud.

In step 3, we filter for the most frequent terms. As said earlier, the frequency filter is set through the min.freq parameter. Changing the parameter will consequently lead to changes in the number of words displayed.

In step 4, we change the words' color combinations. By leveraging Rcolorbrewer, we can easily define palettes of colors to use to color our word cloud.

To look at the available palettes, just run the following command on your R console:

display.brewer.all()

This function will produce the following plot:

How it works...

Labels placed on the left-hand side of the plot can be substituted to be "paired" within palette <- brewer.pal(n = 9, "Paired").

That said, you should be aware that changing the n argument will change the number of colors retrieved from the given brewer palette for your custom palette.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset