pyPdf unable to extract text from some pages in my PDF

后端 未结 6 1036
伪装坚强ぢ
伪装坚强ぢ 2021-01-05 13:07

I\'m trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I\'ve put an example file here:

http://w

相关标签:
6条回答
  • 2021-01-05 13:23

    I'm starting to think I should adopt a messy two-part solution. there are two sections to the PDF, pp 1-82 which have text page labels (pdftotext can extract), and pp 83-end which have no page labels but pyPDF can extract and it explicitly knows pages.

    I think I need to combine the two. Clunky, but I don't see any way round it. Sadly I'm having to do this on a Windows machine.

    0 讨论(0)
  • 2021-01-05 13:35

    Note that extractText() still has problems extracting the text properly. From the documentation for extractText():

    This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.

    Since it is the text you want, you can use the Linux command pdftotext.

    To invoke that using Python, you can do this:

    >>> import subprocess
    >>> subprocess.call(['pdftotext', 'forms.pdf', 'output'])
    

    The text is extracted from forms.pdf and saved to output.

    This works in the case of your PDF file and extracts the text you want.

    0 讨论(0)
  • 2021-01-05 13:35

    I find it sometimes useful to convert it to ps (try with pdf2psand pdftops for potential differences) then back to pdf (ps2pdf). Then try your original script again.

    0 讨论(0)
  • 2021-01-05 13:36

    I had similar problem with some pdfs and for windows, this is working excellent for me:

    1.- Download Xpdf tools for windows

    2.- copy pdftotext.exe from xpdf-tools-win-4.00\bin32 to C:\Windows\System32 and also to C:\Windows\SysWOW64

    3.- use subprocess to run command from console:

    import subprocess
    
    try:
        extInfo = subprocess.check_output('pdftotext.exe '+filePath + ' -',shell=True,stderr=subprocess.STDOUT).strip()
    except Exception as e:
        print (e) 
    
    0 讨论(0)
  • 2021-01-05 13:43

    You could also try the pdfminer library (also in python), and see if it's better at extracting the text. For splitting however, you will have to stick with pyPdf as pdfminer doesn't support that.

    0 讨论(0)
  • 2021-01-05 13:49

    This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

    0 讨论(0)
提交回复
热议问题