How to do it...

Here is how we go about analyzing documents using tf-idf:

Extract the text of all 62 chapters in the book Pride and Prejudice. Then, return chapter-wise occurrence of each word. The total words in the book are approx 1.22M.

Pride_Prejudice_chapters <- austen_books_df %>% 
group_by(book) %>% 
filter(book == "Pride & Prejudice") %>% 
mutate(linenumber = row_number(), 
chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]",
                                        ignore_case = TRUE)))) %>% 
ungroup() %>% 
unnest_tokens(word, text) %>% 
count(book, chapter, word, sort = TRUE) %>% 
ungroup()

Calculate the rank of words such that the most frequently occurring words have lower ranks. Also, visualize the term frequency by rank, as shown in the following figure:

This figure shows lower ranks for words with higher term-frequency (ratio) value

freq_vs_rank <- Pride_Prejudice_chapters %>%  
mutate(rank = row_number(),  
       term_frequency = n/totalwords) 
freq_vs_rank %>%  
  ggplot(aes(rank, term_frequency)) +  
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +  
  scale_x_log10() + 
  scale_y_log10()

Calculate the tf-idf value for each word using the bind_tf-idf function:

Pride_Prejudice_chapters <- Pride_Prejudice_chapters %>% 
bind_tf_idf(word, chapter, n)

Extract and visualize the top 15 words with higher values of tf-idf, as shown in the following figure:

tf-idf values of top 15 words


Pride_Prejudice_chapters %>% 
  select(-totalwords) %>% 
  arrange(desc(tf_idf)) 
 
Pride_Prejudice_chapters %>% 
  arrange(desc(tf_idf)) %>% 
  mutate(word = factor(word, levels = rev(unique(word)))) %>%  
  group_by(book) %>%  
  top_n(15) %>%  
  ungroup %>% 
  ggplot(aes(word, tf_idf, fill = book)) + 
  geom_col(show.legend = FALSE) + 
  labs(x = NULL, y = "tf-idf") + 
  facet_wrap(~book, ncol = 2, scales = "free") + 
  coord_flip()

Table of Contents for How to do it...

Create new playlist

Sign In

Sign Up

Table of Contents for
How to do it...