Text analytics is basically a way to perform quantitative analysis on qualitative information stored in text. In this recipe, we will create a corpus of documents from PDF files and perform descriptive analytics on them, looking for the most frequent terms.
This is a particularly useful recipe for professionals who work with PDF reports.
In this recipe, we will explore the full text of the Italian medieval masterpiece Divine Comedy by Dante Alighieri. You can find out more on Wikipedia at https://en.wikipedia.org/wiki/Divine_Comedy:
Dante Alighieri is shown holding a copy of the Divine Comedy
, next to the entrance to Hell, the seven terraces of Mount Purgatory and the city of Florence, with the spheres of Heaven above, in Michelino's fresco.
In this recipe, we will use the pdftotext
utility in order to read text from the PDF format.
You can download pdftotext
from http://www.foolabs.com/xpdf/download.html. Depending on the operating system you are working on, you will have to perform different steps in order to properly install the package. Proper instructions can be found in the INSTALL
file that comes with each package.
Once you are done with pdftotext
, it is time to install the required packages:
install.packages(c("tm","ggplot2","wordcloud")) library(tm) library(ggplot2) library(wordcloud)
directory <- c("pdf_files")
corpus
object from your reports:corpus <- Corpus(DirSource(directory), readerControl = list( reader = readPDF(), language = "it", encoding = "UTF-8"))
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,removeNumbers)
corpus <- tm_map(corpus,tolower)
corpus <- tm_map(corpus,removeWords,stopwords(kind = "it"))
corpus <- tm_map(corpus,PlainTextDocument)
term_matrix <- DocumentTermMatrix(corpus)
term_matrix <- removeSparseTerms(term_matrix,0.2)
frequent_words <- (colSums(as.matrix(term_matrix))) frequent_words <- as.data.frame(frequent_words) frequent_words <- data.frame( (row.names(frequent_words)),frequent_words) colnames(frequent_words) <- c("term","frequency") frequent_words$terms <- levels(frequent_words$terms) row.names(frequent_words) <- c()
to_be_removed <- c("mai","<e8>","<ab>","s<ec>","pi<f9>","<f2>", "<ab>cos<ec>","<e0>","<e0>","s<e9>","perch<e9>", "gi<f9>","f<e9>","ch<e8>","cos<ec>","gi<e0>","tanto","ch<e9>", "n<e9>") indexes <- match(to_be_removed,frequent_words$term) frequent_words <- frequent_words[-indexes,]
frequent_words <- frequent_words[frequent_words$frequency > 100,]
plot <- ggplot(frequent_words, aes(term, frequency)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 45, hjust=1)) plot
Let's take a look at the following graph:
In step 1, we define the directory where pdf reports are stored. Our first step set the path to the directory to store all PDF reports to be read.
Be aware that since we are performing this analysis within the RStudio project related to this book, the working directory is automatically set to the directory where the project is executed from. We just need to specify the name of the relative path of the folder, that is, the part of path from the working directory to the required folder.
In step 2, we create a corpus
object from your reports. Corpora are the basic objects for text analytics analysis. A corpus can be considered a collection of documents you are going to analyze. We initialized a corpus
object, adding to it all documents within the directory specified in the previous step.
Arguments of the Corpus()
function are:
pdftotext
comes to handIn step 3 to 6, we prepare corpus for analysis. Preparing our corpus for analysis involves the following activities:
tm_map()
functionAfter performing those activities, our corpus will be ready for our analysis.
In step 8, we define a document term matrixDocument
; term matrixes are matrixes representing the frequency of terms that occur in a corpus. Within a document term matrix, rows correspond to documents in the collection, and columns represent terms.
We define a document term matrix by running the DocumentTermMatrix()
function on our corpus
object.
In step 9, we remove infrequent terms. In this step, we define a minimum frequency threshold under which we consider words to not interest us. Be aware that sparsity is defined as a percentage, in which:
0
is the bigger sparsity represented by words that do not appear within the corpus1
is the smaller sparsity, that is, words with the maximum observed frequencyIn step 10, we find out the most frequent words. This step computes the total frequency for each word on the all corpus using the colSums()
function. It then creates a data frame composed of terms and frequency.
In step 11, we remove meaningless terms. This step is specifically related to PDF documents since it results in removing the reading error in order to prevent errors within the final stat. For instance, we remove <e8>
, which is a wrong reading of the character è
, which stand for "is" in Italian.
In step 12, we remove infrequent words, leveraging the previously computed term frequencies and setting a threshold at 100 repetitions.
In step 13, we now plot our data, leveraging a basic ggplot
package. Refer to the Adding text to a ggplot2 plot at a custom location recipe in Chapter 3, Basic Visualization Techniques, which provides a good introduction to these plots.