pyPdf unable to extract text from some pages in my PDF

后端 未结 6 1046
伪装坚强ぢ
伪装坚强ぢ 2021-01-05 13:07

I\'m trying to use pyPdf to extract and print pages from a multipage PDF. Problem is, text is not extracted from some pages. I\'ve put an example file here:

http://w

6条回答
  •  一生所求
    2021-01-05 13:49

    This isn't really an answer, but the problem with pyPdf is this: it doesn't yet support CMaps. PDF allows fonts to use CMaps to map character IDs (bytes in the PDF) to Unicode character codes. When you have a PDF that contains non-ASCII characters, there's probably a CMap in use, and even sometimes when there's no non-ASCII characters. When pyPdf encounters strings that are not in standard Unicode encoding, it just sees a bunch of byte code; it can't convert those bytes to Unicode, so it just gives you empty strings. I actually had this same problem and I'm working on the source code at the moment. It's time consuming, but I hope to send a patch to the maintainer some time around mid-2011.

提交回复
热议问题