Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Extracting data from a PDF

The ubiquity of PDF files is due to the ability of almost every PC, Mac, and smart device to open and process this format. Electronic documents are often exchanged as PDF because they cannot be easily altered and are, by default, read-only.

Many organizations use PDF files to distribute reports, bank statements, and invoices. Being able to read such documents and extract the information they provide it's an invaluable tool in the belt of a Groovy programmer.

This recipe focuses on mining information from a PDF file.

Getting ready

As for ZIP files (see the Reading data from a ZIP file recipe), Groovy doesn't have any class to deal with PDF files. Java too doesn't offer any built-in feature to read or write PDFs. Therefore, we are left to resorting to a third-party library. A Google search for Java read PDF yields numerous results with links to various libraries.

In this recipe, we will use iText, the most popular PDF library for the Java ecosystem. iText is a very powerful library for generating PDF files, but it also has a very simple API for mining the text inside the PDF file.

For demonstration purposes, we are going to use a PDF version of Chapter 1, Getting Started with Groovy of this book (a version of the file is attached to the code distribution) located in the groovy2cookbook_chapter1.pdf file:

How to do it...

The Groovy code that follows shows you how to open a PDF file and dump the contents of the pages of a PDF file in the console.

First of all, we need to @Grab the iText library and declare all imported classes that we are going to make use of:
```
@Grab('com.itextpdf:itextpdf:5.3.2')
import com.itextpdf.text.pdf.parser.*
import com.itextpdf.text.pdf.*
```

After that, we can construct objects that help to achieve our final target:

def pdf = new PdfReader('groovy2cookbook_chapter1.pdf')
def maxPages = pdf.numberOfPages + 1
def parser = new PdfTextExtractor()

And now, all that is left is to iterate through all the pages and extract the text:

(1..<maxPages).each { pageNumber ->
  println parser.getTextFromPage(pdf, pageNumber)
}

Output should be as follows:

01
Getting started with Groovy
In this chapter, we will cover:
? Installing Groovy on Windows
? Installing Groovy on Linux and OSX
...

How it works...

The previous script does some interesting stuff. First, we use Grape (see the Simplifying dependency management with Grape recipe in Chapter 2, Using Groovy Ecosystem) to fetch the latest version of iText from a Maven repository (v5.3.2), through the Grab annotation. Then an instance of the com.itextpdf.text.pdf.PdfReader class is created for reading the PDF document. PdfReader can be constructed with different arguments, but we chose String for simplicity. After instantiating PdfReader, we get the number of pages of the PDF file we intend to analyze. Again, to get the number of pages, it's a simple call to the getNumberOfPages method of PdfReader.

Finally, we loop through all the pages and, for each page, we call getTextFromPage from the com.itextpdf.text.pdf.parser.PdfTextExtractor class. The method returns the text found in the page which is printed on the console.

There's more...

Extracting text from a PDF file is relatively easy in Groovy (and Java), but interpreting the structure of a PDF file can be a very daunting task as PDF files have a layout-oriented structure rather than a content-oriented one. If you have to cope with PDF documents that have a nonstandard structure (for example, columns or tables), you may want to write your own strategy for text extraction. The getTextFromPage method of the PdfTextExtractor class accepts instances of the TextExtractionStrategy interface.

iText has some implementations of the interface, such as SimpleTextExtractionStrategy, which stores all the snippets in the order they occur in the stream; but it is smart enough to detect which text portions should be combined into a single word or separated with a space character.

There is also a LocationTextExtractionStrategy interface that allows you to extract text only from certain area of a PDF file. The next script is a modified version of the previous one and shows you how to use LocationTextExtractionStrategy combined with FilteredTextRenderListener. We define a small rectangular area from which the text is extracted. In this case, it's the area of the chapter's title. The part of the code that does the text extraction changes as we are passing the strategy to the getTextFromPage method and we only execute for the first page:

@Grab('com.itextpdf:itextpdf:5.3.2')
import com.itextpdf.text.pdf.parser.*
import com.itextpdf.text.pdf.*
import com.itextpdf.text.Rectangle
def rect = new Rectangle(0, 550, 1000, 800)
def pdf = new PdfReader('groovy2cookbook_chapter1.pdf')
def parser = new PdfTextExtractor()
def strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(),
  new RegionTextRenderFilter(rect))
println parser.getTextFromPage(pdf, 1, strategy)

The output should be as follows:

01
Getting started with Groovy

Another thing that you may face when parsing the PDF files is dealing with non-English texts. iText does a good job extracting text data for you, but in order to get proper result you need to know which encoding was used in the PDF file for the text you want to extract. For example, for saving Russian text encoded with the KOI8-R charset, you can use the following snippet:

new File('output.txt').withWriter('KOI8-R') { writer ->
  (1..<maxPages).each {
    writer << parser.getTextFromPage(pdf, it)
  }
}

This code saves the extracted text into the output.txt file using the specified encoding.

Table of Contents for
Extracting data from a PDF

Extracting data from a PDF

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for Extracting data from a PDF

Create new playlist

Sign In

Sign Up

Extracting data from a PDF

Getting ready

How to do it...

How it works...

There's more...

See also

Table of Contents for
Extracting data from a PDF