How to do it...

Here is how we go about analyzing documents using tf-idf:

  1. Extract the text of all 62 chapters in the book Pride and Prejudice. Then, return chapter-wise occurrence of each word. The total words in the book are approx 1.22M.
Pride_Prejudice_chapters <- austen_books_df %>% 
group_by(book) %>% 
filter(book == "Pride & Prejudice") %>% 
mutate(linenumber = row_number(), 
chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]",
ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word, text) %>% count(book, chapter, word, sort = TRUE) %>% ungroup()
  1. Calculate the rank of words such that the most frequently occurring words have lower ranks. Also, visualize the term frequency by rank, as shown in the following figure:
This figure shows lower ranks for words with higher term-frequency (ratio) value
freq_vs_rank <- Pride_Prejudice_chapters %>%  
mutate(rank = row_number(),  
       term_frequency = n/totalwords) 
freq_vs_rank %>%  
  ggplot(aes(rank, term_frequency)) +  
  geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) +  
  scale_x_log10() + 
  scale_y_log10()
  1. Calculate the tf-idf value for each word using the bind_tf-idf function:
Pride_Prejudice_chapters <- Pride_Prejudice_chapters %>% 
bind_tf_idf(word, chapter, n)
  1. Extract and visualize the top 15 words with higher values of tf-idf, as shown in the following figure:
tf-idf values of top 15 words

Pride_Prejudice_chapters %>% select(-totalwords) %>% arrange(desc(tf_idf)) Pride_Prejudice_chapters %>% arrange(desc(tf_idf)) %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(book) %>% top_n(15) %>% ungroup %>% ggplot(aes(word, tf_idf, fill = book)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~book, ncol = 2, scales = "free") + coord_flip()
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset