Extracting text

To extract the pages of the pdf file, the PyPDF2 module has the extractText() method. Create a script called extract_text.py and write the following content in it:

import PyPDF2
with open('test.pdf', 'rb') as pdf:
    read_pdf = PyPDF2.PdfFileReader(pdf)
    pdf_page = read_pdf.getPage(1)
    pdf_content = pdf_page.extractText()
    print(pdf_content)

Run the script and you will get the following output:

student@ubuntu:~/work$ python3 extract_text.py

Following is the output:

3Pythoncommands
9
3.1Comments........................................
.9
3.2Numbersandotherdatatypes........................
......9
3.2.1The
type
function................................9
3.2.2Strings.......................................
10
3.2.3Listsandtuples................................
..10
3.2.4The
range
function................................11
3.2.5Booleanvalues.................................
.11
3.3Expressions.....................................
...11
3.4Operators.......................................

In the preceding example, we created a file reader object. The pdf reader object has a function called getPage(), which gets the page number (it starts from the 0th index) as an argument and returns the page object. Next, we used the extractText() method, which will extract the text from the page number that we mentioned in getPage(). The page index starts from 0.

Table of Contents for Extracting text

Create new playlist

Sign In

Sign Up

Table of Contents for
Extracting text