decode CID font codes to equivalent ASCII characters

拈花ヽ惹草 提交于 2019-12-07 12:03:14

问题


I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output:

(cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92
(cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87)
(cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:80)(cid:82)(cid:85)(cid:72)(cid:3)(cid:87)
(cid:75)(cid:68)(cid:81)(cid:3)(cid:20)(cid:19)(cid:3)

When I look at that exact snippet of text in the PDF, the letters are certainly convertible to ASCII:

This probably suggests that a brute force decoding would work (i.e. read a snippet of text that corresponds with a bunch of CID codes and create a mapping that way), but will this be reliable across lots of different PDFs? Is there a reliable mapping from these CID codes to ASCII characters or will that be highly dependent on the font in the PDF? How can I determine what ASCII character the a CID code like (cid:72) corresponds with?

For what its worth, I'm extracting the text using PDFminer, which appears to be the only tool that actually reports the CID codes. If there is a better tool out there for converting PDFs to HTML or any other parsable text format, I'm open to other suggestions!

As an added bonus, this question appears to be related to a few other unanswered questions, so there is a rich bounty of reputation on the line here:

  • Font cannot be extracted by PDFMiner
  • What is this (cid:51) in the output of pdf2txt?

回答1:


While you can probably do this by guesswork for the simple example here, to really do it correctly you'll need 2 additional pieces of information:

1) The Registry-Ordering-Supplement (ROS) information for the font in question. This will usually be something like 'Adobe-Japan1-5' or some such and is an informational property stored in the font. The ROS determines how the CIDs are to be interpreted. A given CID in one font is not necessarily the same as a CID in another font, unless the ROSes are the same. That is to say: CID12345 in Adobe-Japan1-5 is not the same shape as CID12345 in Adobe-GB1-3!

2) Armed with the ROS info, select a compatible CMap and decode through that. ASCII is a bit short-sighted; I would go with Unicode of which ASCII is a subset. You can find CMap files for the Adobe-defined ROSes at http://sourceforge.net/projects/cmap.adobe/files/

More information on CID and CMaps direct from the inventors is available at http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf



来源:https://stackoverflow.com/questions/24089245/decode-cid-font-codes-to-equivalent-ascii-characters

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!