Unicode in PDF

后端 未结 7 570
醉酒成梦
醉酒成梦 2020-12-01 08:57

My program generates relatively simple PDF documents on request, but I\'m having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in

相关标签:
7条回答
  • 2020-12-01 09:56

    I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.

    seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).

    So to use unicode in pdf directly

    1. you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
    2. you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
    3. also, 2-byte characters don't even cover the full Unicode space

    IMHO, these points make it absolutely unfeasible to use unicode directly.



    What I am doing instead now is using the characters indirectly in the following way: For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like

    std::map<std::string, std::vector<wchar_t> > Codepage;
    std::map<std::string, std::map<wchar_t, int> > LookupTable;
    

    then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:

    for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
    {                
        if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
        {
            LookupTable[fontname][*i] = Codepage[fontname].size();
            Codepage[fontname].push_back(*i);
        }
    }
    

    then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:

    static std::string hex = "0123456789ABCDEF";
    std::string result = "<";
    for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
    {                
        int id = LookupTable[fontname][*i] + 1;
        result += hex[(id & 0x00F0) >> 4];
        result += hex[(id & 0x000F)];
    }
    result += ">";
    

    for example, "H€llo World!" might become <01020303040506040703080905> and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...

    but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences

    For the "H€llo World!" example, this Font-Object would work:

    5 0 obj 
    <<
        /F1
        <<
            /Type /Font
            /Subtype /Type1
            /BaseFont /Times-Roman
            /Encoding
            <<
              /Type /Encoding
              /Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
            >>
        >> 
    >>
    endobj 
    

    I generate it with this code:

    ObjectOffsets.push_back(stream->tellp()); // xrefs entry
    (*stream) << ObjectCounter++ << " 0 obj \n<<\n";
    int fontid = 1;
    for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
    {
        (*stream) << "  /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
    
        (*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
        for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
            (*stream) << "    /" << GlyphName(*j) << "\n";
        (*stream) << "  ] >>";
    
        (*stream) << " >> \n";
    }
    (*stream) << ">>\n";
    (*stream) << "endobj \n\n";
    

    Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...

    So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at

    http://www.jdawiseman.com/papers/trivia/character-entities.html

    and it looks like this

    const std::string GlyphName(wchar_t UnicodeCodepoint)
    {
        switch(UnicodeCodepoint)
        {
            case 0x00A0: return "nonbreakingspace";
            case 0x00A1: return "exclamdown";
            case 0x00A2: return "cent";
            ...
        }
    }
    

    A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.

    Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

    0 讨论(0)
提交回复
热议问题