Unicode in PDF

后端 未结 7 569
醉酒成梦
醉酒成梦 2020-12-01 08:57

My program generates relatively simple PDF documents on request, but I\'m having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in

相关标签:
7条回答
  • 2020-12-01 09:48

    See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.

    Check out PDFBox or iText.

    http://www.adobe.com/devnet/pdf/pdf_reference.html

    0 讨论(0)
  • 2020-12-01 09:51

    Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work. Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object. Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont. The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification. And that's it! I've just did it and now I have a Unicode PDF!

    Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020

    And yes, don't forget /ToUnicode entry as @user2373071 pointed out or user will not be able to search your PDF or copy text from it.

    0 讨论(0)
  • 2020-12-01 09:53

    The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

    0 讨论(0)
  • 2020-12-01 09:54

    In the PDF reference in chapter 3, this is what they say about Unicode:

    Text strings are encoded in either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode is described in the Unicode Standard by the Unicode Consortium (see the Bibliography). For text strings encoded in Unicode, the first two bytes must be 254 followed by 255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified in the Unicode standard. (This mechanism precludes beginning a string using PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to be a meaningful beginning of a word or phrase).

    0 讨论(0)
  • 2020-12-01 09:54

    I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:

    Are you sure you are using a font that supports all the characters you need?

    In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

    0 讨论(0)
  • 2020-12-01 09:56

    As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

    0 讨论(0)
提交回复
热议问题