I\'m trying to extract the text included in this PDF file using Python
.
I\'m using the PyPDF2 module, and have the following script:
imp
PyPDF2 does work, but results may vary. I am seeing quite inconsistent findings from its result extraction.
reader=PyPDF2.pdf.PdfFileReader(self._path)
eachPageText=[]
for i in range(0,reader.getNumPages()):
pageText=reader.getPage(i).extractText()
print(pageText)
eachPageText.append(pageText)
PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I'm really satisfied you can use this link for more info
pdftotext is the best and simplest one! pdftotext also reserves the structure as well.
I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.
Here is the simplest code for extracting text
code:
# importing required modules
import PyPDF2
# creating a pdf file object
pdfFileObj = open('filename.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
# printing number of pages in pdf file
print(pdfReader.numPages)
# creating a page object
pageObj = pdfReader.getPage(5)
# extracting text from page
print(pageObj.extractText())
# closing the pdf file object
pdfFileObj.close()
Use textract.
It supports many types of files including PDFs
import textract
text = textract.process("path/to/file.extension")
The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the PyPDF2
library in your environment. If not installed, open the command prompt and run the following command:
pip3 install PyPDF2
Solution Code:
import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
page = pdfReader.getPage(i)
print(page.extractText())