Here is how we go about analyzing documents using tf-idf:
- Extract the text of all 62 chapters in the book Pride and Prejudice. Then, return chapter-wise occurrence of each word. The total words in the book are approx 1.22M.
Pride_Prejudice_chapters <- austen_books_df %>% group_by(book) %>% filter(book == "Pride & Prejudice") %>% mutate(linenumber = row_number(), chapter = cumsum(str_detect(text, regex("^chapter [\divxlc]",
ignore_case = TRUE)))) %>% ungroup() %>% unnest_tokens(word, text) %>% count(book, chapter, word, sort = TRUE) %>% ungroup()
- Calculate the rank of words such that the most frequently occurring words have lower ranks. Also, visualize the term frequency by rank, as shown in the following figure:
This figure shows lower ranks for words with higher term-frequency (ratio) value
freq_vs_rank <- Pride_Prejudice_chapters %>% mutate(rank = row_number(), term_frequency = n/totalwords) freq_vs_rank %>% ggplot(aes(rank, term_frequency)) + geom_line(size = 1.1, alpha = 0.8, show.legend = FALSE) + scale_x_log10() + scale_y_log10()
- Calculate the tf-idf value for each word using the bind_tf-idf function:
Pride_Prejudice_chapters <- Pride_Prejudice_chapters %>%
bind_tf_idf(word, chapter, n)
- Extract and visualize the top 15 words with higher values of tf-idf, as shown in the following figure:
tf-idf values of top 15 words
Pride_Prejudice_chapters %>% select(-totalwords) %>% arrange(desc(tf_idf)) Pride_Prejudice_chapters %>% arrange(desc(tf_idf)) %>% mutate(word = factor(word, levels = rev(unique(word)))) %>% group_by(book) %>% top_n(15) %>% ungroup %>% ggplot(aes(word, tf_idf, fill = book)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~book, ncol = 2, scales = "free") + coord_flip()