问题
I am converting some pdf reports to plain text using PDFMiner and a bunch of my input pdfs just come out with a couple of recognised lines and then a list of (cid:%d) a little like this...
Inspection report
(cid:4)(cid:5)(cid:6)(cid:7)(cid:8)(cid:9) (cid:10)(cid:9)(cid:11)(cid:9)(cid:12)(cid:9)(cid:5)(cid:13)(cid:9) (cid:14)(cid:8)(cid:15)(cid:16)(cid:9)(cid:12) (cid:17)(cid:18)(cid:13)(cid:19)(cid:20) (cid:21)(cid:8)(cid:22)(cid:23)(cid:18)(cid:12)(cid:6)(cid:22)(cid:24) (cid:25)(cid:5)(cid:26)(cid:27)(cid:9)(cid:13)(cid:22)(cid:6)(cid:18)(cid:5) (cid:5)(cid:8)(cid:15)(cid:16)(cid:9)(cid:12)
Checking it out I think the problem is the bulk of the document is in a font that is resisting being extracted. Debugging the problem has been kind of strange because the font seemed to change over night (don't ask how, it just did).
I'm not sure what might be significant but today the font has properties:
name = 'font0000000018f29a3e' - cidcoding = 'Adobe-Identity'- unicode_map = 'UnicodeMap: /Adobe-Identity-UCS' - unicode_map.cid2unichr = {}
I'm using 2.7 on a mac and have tried a few things
- PyPDF2
- Copying and pasting into textedit (characters are blank)
- Uninstalling and reinstalling with cmaps rebuilt
- Turning the machine off and then on again
For reference the reports are all of similar form one of which can be found here.
http://www.ofsted.gov.uk/provider/files/959173/urn/118074.pdf
The problem applies to all reports published prior to September 2010
来源:https://stackoverflow.com/questions/22908556/font-cannot-be-extracted-by-pdfminer