how to add unicode in truetype0font on pdfbox 2.0.0?

亡梦爱人 提交于 2019-11-26 20:28:02

Here's some code to add a ToUnicode CMap stream in a font. Obviously I can't do it with your file, so I used one of my test files, which can be found here. I had to work on each entry separately and didn't do all. However the result is good enough to extract the first word in the green print ("Bedingungen").

The scenario is somewhat tailored to you:

  • Identity-H entry
  • no ToUnicode entry
  • specific font name

    try (PDDocument doc = PDDocument.load(f))
    {
        for (int p = 0; p < doc.getNumberOfPages(); ++p)
        {
            PDPage page = doc.getPage(p);
            PDResources res = page.getResources();
            for (COSName fontName : res.getFontNames())
            {
                PDFont font = res.getFont(fontName);
                COSBase encoding = font.getCOSObject().getDictionaryObject(COSName.ENCODING);
                if (!COSName.IDENTITY_H.equals(encoding))
                {
                    continue;
                }
                // get real name
                String fname = font.getName();
                int plus = fname.indexOf('+');
                if (plus != -1)
                {
                    fname = fname.substring(plus + 1);
                }
                if (font.getCOSObject().containsKey(COSName.TO_UNICODE))
                {
                    continue;
                }
                System.out.println("File '" + f.getName() + "', page " + (p + 1) + ", " + fontName.getName() + ", " + font.getName());
                if (!fname.startsWith("Calibri-Bold"))
                {
                    continue;
                }
                COSStream toUnicodeStream = new COSStream();
                try (PrintWriter pw = new PrintWriter(toUnicodeStream.createOutputStream(COSName.FLATE_DECODE)))
                {
                    // "9.10 Extraction of Text Content" in the PDF 32000 specification
                    pw.println ("/CIDInit /ProcSet findresource begin\n" +
                            "12 dict begin\n" +
                            "begincmap\n" +
                            "/CIDSystemInfo\n" +
                            "<< /Registry (Adobe)\n" +
                            "/Ordering (UCS) /Supplement 0 >> def\n" +
                            "/CMapName /Adobe-Identity-UCS def\n" +
                            "/CMapType 2 def\n" +
                            "1 begincodespacerange\n" +
                            "<0000> <FFFF>\n" +
                            "endcodespacerange\n" +
                            "10 beginbfchar\n" + // number is count of entries
                            "<0001><0020>\n" + // space
                            "<0002><0041>\n" + // A
                            "<0003><0042>\n" + // B
                            "<0004><0044>\n" + // D
                            "<0013><0065>\n" + // e
                            "<0012><0064>\n" + // d
                            "<0017><0069>\n" + // i
                            "<001B><006E>\n" + // n
                            "<0015><0067>\n" + // g
                            "<0020><0075>\n" + // u
                            "endbfchar\n" +
                            "endcmap CMapName currentdict /CMap defineresource pop end end");
                }
                font.getCOSObject().setItem(COSName.TO_UNICODE, toUnicodeStream);
            }
        }
        doc.save("huhu.pdf");
    }
    

Btw the unreleased 2.1 version of PDFDebugger has some improved features to show fonts, you can get it here:

You can use it to verify that your ToUnicode CMap makes sense. Here's what I get with my changes:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!