Font cannot be extracted by PDFMiner

对着背影说爱祢 提交于 2019-12-01 19:05:58

问题


I am converting some pdf reports to plain text using PDFMiner and a bunch of my input pdfs just come out with a couple of recognised lines and then a list of (cid:%d) a little like this...

Inspection report

(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9) (cid:10)(cid:9)(cid:11)(cid:9)(cid:12)(cid:9)(cid:5)(cid:13)(cid:9) (cid:14)(cid:8)(cid:15)(cid:16)(cid:9)(cid:12) (cid:17)(cid:18)(cid:13)(cid:19)(cid:20) (cid:21)(cid:8)(cid:22)(cid:23)(cid:18)(cid:12)(cid:6)(cid:22)(cid:24) (cid:25)(cid:5)(cid:26)(cid:27)(cid:9)(cid:13)(cid:22)(cid:6)(cid:18)(cid:5) (cid:5)(cid:8)(cid:15)(cid:16)(cid:9)(cid:12)

Checking it out I think the problem is the bulk of the document is in a font that is resisting being extracted. Debugging the problem has been kind of strange because the font seemed to change over night (don't ask how, it just did).

I'm not sure what might be significant but today the font has properties:

name = 'font0000000018f29a3e' - cidcoding = 'Adobe-Identity'- unicode_map = 'UnicodeMap: /Adobe-Identity-UCS' - unicode_map.cid2unichr = {}

I'm using 2.7 on a mac and have tried a few things

  1. PyPDF2
  2. Copying and pasting into textedit (characters are blank)
  3. Uninstalling and reinstalling with cmaps rebuilt
  4. Turning the machine off and then on again

For reference the reports are all of similar form one of which can be found here.

http://www.ofsted.gov.uk/provider/files/959173/urn/118074.pdf

The problem applies to all reports published prior to September 2010

来源:https://stackoverflow.com/questions/22908556/font-cannot-be-extracted-by-pdfminer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!