Python text extraction does not work on some pdfs

梦想与她 提交于 2019-12-05 07:39:25

问题


I am trying to read a pdf through url. I followed many stackoverflow suggestions and used PyPdf2 FileReader to extract text from the pdf. My code looks like this :

url = "http://kat.kar.nic.in:8080/uploadedFiles/C_13052015_ch1_l1.pdf"
#url = "http://kat.kar.nic.in:8080/uploadedFiles/C_06052015_ch1_l1.pdf"
f = urlopen(Request(url)).read()
fileInput = StringIO(f)
pdf = PyPDF2.PdfFileReader(fileInput)

print pdf.getNumPages()
print pdf.getDocumentInfo()
print pdf.getPage(1).extractText()

I am able to successfully extract text for first link. But if I use the same program for the second pdf. I do not get any text. The page numbers and document info seem to show up.

I tried extracting text from Pdfminer through terminal and was able to extract text from the second pdf.

Any idea what is wrong with the pdf or is there a drawback with the libraries I am using ?


回答1:


If you read the comments in the pyPDF documentation you'll see that it's written right there that this functionality will not work well for some PDF files; in other words, you're looking at a restriction of the library.

Looking at the two PDF files, I can't see anything wrong with the files themselves. But...

The first file contains fully embedded fonts The second file contains subsetted fonts

This means that the second file is more difficult to extract text from and the library probably doesn't support that properly. Just for reference I did a text extraction with callas pdfToolbox (caution, I'm affiliated with this tool) which uses the Acrobat text extraction and the text is properly extracted for both files (confirming that it's not the PDF files that are the problem).



来源:https://stackoverflow.com/questions/30272269/python-text-extraction-does-not-work-on-some-pdfs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!