Analyzing PDF reports in a folder with the tm package

Text analytics is basically a way to perform quantitative analysis on qualitative information stored in text. In this recipe, we will create a corpus of documents from PDF files and perform descriptive analytics on them, looking for the most frequent terms.

This is a particularly useful recipe for professionals who work with PDF reports.

In this recipe, we will explore the full text of the Italian medieval masterpiece Divine Comedy by Dante Alighieri. You can find out more on Wikipedia at https://en.wikipedia.org/wiki/Divine_Comedy:

Dante Alighieri is shown holding a copy of the Divine Comedy, next to the entrance to Hell, the seven terraces of Mount Purgatory and the city of Florence, with the spheres of Heaven above, in Michelino's fresco.

Getting ready

In this recipe, we will use the pdftotext utility in order to read text from the PDF format.

You can download pdftotext from http://www.foolabs.com/xpdf/download.html. Depending on the operating system you are working on, you will have to perform different steps in order to properly install the package. Proper instructions can be found in the INSTALL file that comes with each package.

Once you are done with pdftotext, it is time to install the required packages:

install.packages(c("tm","ggplot2","wordcloud"))
library(tm)
library(ggplot2)
library(wordcloud)

How to do it...

  1. Define the directory where PDF reports are stored:
    directory <- c("pdf_files")
    
  2. Create a corpus object from your reports:
    corpus <- Corpus(DirSource(directory),
      readerControl = list( reader = readPDF(), language = "it", encoding = "UTF-8"))
    
  3. Remove punctuation:
    corpus <- tm_map(corpus,removePunctuation)
    
  4. Remove numbers:
    corpus <- tm_map(corpus,removeNumbers)
    
  5. Change capital letters to lowercase:
    corpus <- tm_map(corpus,tolower)
    
  6. Remove stop words:
    corpus <- tm_map(corpus,removeWords,stopwords(kind = "it"))
    
  7. Put every document into plain text format:
    corpus <- tm_map(corpus,PlainTextDocument)
    
  8. Define a document term matrix:
    term_matrix <- DocumentTermMatrix(corpus)
    
  9. Remove infrequent terms (sparse terms):
    term_matrix <- removeSparseTerms(term_matrix,0.2)
    
  10. Find out the most frequents words:
    frequent_words <- (colSums(as.matrix(term_matrix)))
    frequent_words <- as.data.frame(frequent_words)
    frequent_words <- data.frame( (row.names(frequent_words)),frequent_words)
    colnames(frequent_words) <- c("term","frequency")
    frequent_words$terms <- levels(frequent_words$terms)
    row.names(frequent_words) <- c()
    
  11. Remove insignificant terms:
    to_be_removed <- c("mai","<e8>","<ab>","s<ec>","pi<f9>","<f2>",
      "<ab>cos<ec>","<e0>","<e0>","s<e9>","perch<e9>",
      "gi<f9>","f<e9>","ch<e8>","cos<ec>","gi<e0>","tanto","ch<e9>",  "n<e9>")
    indexes <- match(to_be_removed,frequent_words$term)
    frequent_words <- frequent_words[-indexes,]
    
  12. Filter for only frequent terms:
    frequent_words <- frequent_words[frequent_words$frequency > 100,]
    
  13. Plot your frequent works:
    plot <- ggplot(frequent_words, aes(term, frequency)) + geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 45, hjust=1))
    plot
    

    Let's take a look at the following graph:

    How to do it...

How it works...

In step 1, we define the directory where pdf reports are stored. Our first step set the path to the directory to store all PDF reports to be read.

Be aware that since we are performing this analysis within the RStudio project related to this book, the working directory is automatically set to the directory where the project is executed from. We just need to specify the name of the relative path of the folder, that is, the part of path from the working directory to the required folder.

In step 2, we create a corpus object from your reports. Corpora are the basic objects for text analytics analysis. A corpus can be considered a collection of documents you are going to analyze. We initialized a corpus object, adding to it all documents within the directory specified in the previous step.

Arguments of the Corpus() function are:

  • Directory path
  • Reader to be used for document loading; this is where the previously installed pdftotext comes to hand
  • Document language
  • Character encoding ( UTF-8 in our case)

In step 3 to 6, we prepare corpus for analysis. Preparing our corpus for analysis involves the following activities:

  • Removing punctuation, which gives no added value to our analysis but can modify counts and stats on the corpus content
  • Removing numbers, for a similar reason to punctuation
  • Transforming all capital letters to lowercase so that the same words with and without capital letters are not counted twice
  • Removing stop words, such as "not," "or," or "and;" stop words can also be customized by passing a custom defined vector of stop words to the tm_map() function

After performing those activities, our corpus will be ready for our analysis.

In step 8, we define a document term matrixDocument; term matrixes are matrixes representing the frequency of terms that occur in a corpus. Within a document term matrix, rows correspond to documents in the collection, and columns represent terms.

We define a document term matrix by running the DocumentTermMatrix() function on our corpus object.

In step 9, we remove infrequent terms. In this step, we define a minimum frequency threshold under which we consider words to not interest us. Be aware that sparsity is defined as a percentage, in which:

  • 0 is the bigger sparsity represented by words that do not appear within the corpus
  • 1 is the smaller sparsity, that is, words with the maximum observed frequency

In step 10, we find out the most frequent words. This step computes the total frequency for each word on the all corpus using the colSums() function. It then creates a data frame composed of terms and frequency.

In step 11, we remove meaningless terms. This step is specifically related to PDF documents since it results in removing the reading error in order to prevent errors within the final stat. For instance, we remove <e8>, which is a wrong reading of the character è, which stand for "is" in Italian.

In step 12, we remove infrequent words, leveraging the previously computed term frequencies and setting a threshold at 100 repetitions.

In step 13, we now plot our data, leveraging a basic ggplot package. Refer to the Adding text to a ggplot2 plot at a custom location recipe in Chapter 3, Basic Visualization Techniques, which provides a good introduction to these plots.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset