Unicode characters necessary for Japanese, Korean, and Chinese

后端 未结 3 1835
梦毁少年i
梦毁少年i 2021-01-25 02:14

I\'m trying to answer these basic questions without getting a degree in linguistics and early human history, which seems to be where every google search has lead.

相关标签:
3条回答
  • 2021-01-25 02:44

    Start with the East Asian Scripts in the Code Charts @ unicode.org.

    For example, Hiragana is U+3040 to U+309F, and Katakana is U+30A0 to U+30FF.

    0 讨论(0)
  • 2021-01-25 02:52

    You can approximate such lists by looking at the appropriate Unicode properties (in particular, the "Script" of each character), but this does not fully reflect actual character use.

    A better indicator would the character sets that have already been defined for fonts for those languages (e.g., Adobe-Japan-1-6, Adobe-GB-1-5, and Adobe-Korea1-2) described in this tech note (the exact character sets are defined separately). The CMap files should allow you to translate them back into Unicode code points.

    0 讨论(0)
  • 2021-01-25 02:53

    It depend on how many coverage you want to give to each of those languages. Most commonly used characters in all these languages would only require a few thousands characters, but then once in a while you will encounter some characters outside the coverage. As you increase the number of characters supported by your system, you will be less likely to encounter these missing characters, until a point that you cover all the CJK characters.

    A common approach used by modern font developers, in order to cut time and effort in making font and yet support enough amount of characters so that it would display most fonts, is to use ranges given in pre-Unicode era character set like Big5(-HKSCS), GB2312 or 18030, and such as mentioned in comment of others' answer, but then it would be rather common to encounter characters that are not supported.

    In Unicode, something called IICore was made and defined about ten thousand characters that would be minimally essential to supporting these languages, and in Unicode database there are also info about whether they're essential to Chinese, Japanese, Korea or such, however nowadays barely anyone use them.

    Google and Adobe is now making the Noto CJK or known as Source Han fonts, which is supposed to cover as much CJK characters as example. However, due to limitation in file format, they can only put in about 65535 glyphs into the font and thus would have to adding/dropping characters in the process of making them.

    And at last, specifically for Korean, supporting only Hangul/Jamo is probably good enough in many cases because Hanja (the ideograph character) have been largely out of use other than in specialized area. Note that person names and some words in title could be part of these aspects that would still use Hanja so it depend if they're important to you or not

    0 讨论(0)
提交回复
热议问题