Simplified Chinese Unicode table

后端 未结 6 1858
遥遥无期
遥遥无期 2020-12-05 12:45

Where can I find a Unicode table showing only the simplified Chinese characters? I have searched everywhere but cannot find anything.

UPDATE :
I

相关标签:
6条回答
  • 2020-12-05 13:06

    I don't believe that there's a table with only simplified code points. I think they're all lumped together in the CJK range of 0x4E00 through 0x9FFF

    0 讨论(0)
  • 2020-12-05 13:07

    The Unihan database contains this information in the file Unihan_Variants.txt. For example, a pair of traditional/simplified characters are:

    U+673A  kTraditionalVariant     U+6A5F
    U+6A5F  kSimplifiedVariant      U+673A
    

    In the above case, U+6A5F is 機, the traditional form of 机 (U+673A).

    Another approach is to use the CC-CEDICT project, which publishes a dictionary of Chinese characters and compounds (both traditional and simplified). Each entry looks something like:

    宕機 宕机 [dang4 ji1] /to crash (of a computer)/Taiwanese term for 當機|当机[dang4 ji1]/

    The first column is traditional characters, and the second column is simplified.

    To get all the simplified characters, read this text file and make a list of every character that appears in the second column. Note that some characters may not appear by themselves (only in compounds), so it is not sufficient to look at single-character entries.

    0 讨论(0)
  • 2020-12-05 13:07

    I'm not sure if that's easily done. The Han ideographs are unified in Unicode, so it's not immediately obvious how to do it. But the Unihan database (http://www.unicode.org/charts/unihan.html) might have the data you need.

    0 讨论(0)
  • 2020-12-05 13:10

    According to wikipedia simplified Chinese v. traditional, kanji, or other formats is left up to the font rendering in many cases. So while you could have a selection of simplified Chinese codepoints, this list would not be at all complete since many characters are no longer distinct.

    0 讨论(0)
  • 2020-12-05 13:15

    The OP doesn't indicate which language they're using, but if you're using Ruby, I've written a small library that can distinguish between simplified and traditional Chinese (plus Korean and Japanese as a bonus). As suggested in Greg's answer, it relies on a distilled version of Unihan_Variants.txt to figure out which chars are exclusively simplified and which are exclusively traditional.

    https://github.com/jpatokal/script_detector

    Sample:

    p string
    => "我的氣墊船充滿了鱔魚."
    > string.chinese?
    => true
    > string.traditional_chinese?
    => true
    > string.simplified_chinese?
    => false
    

    But as the Unicode FAQ duly warns, this requires sizable fragments of text to work reliably, and will give misleading results for short strings. Consider the Japanese for Tokyo:

    p string
    => "東京"
    > string.chinese?
    => true
    > string.traditional_chinese?
    => true
    > string.japanese?
    => false
    

    Since both characters happen to also be valid traditional Chinese, and there are no exclusively Japanese characters, it's not recognized correctly.

    0 讨论(0)
  • Here is a regex of all simplified Chinese characters I made. For some reason Stackoverflow is complaining, so it's linked in a pastebin below.

    https://pastebin.com/xw4p7RVJ

    You'll notice that this list features ranges rather than each individual character, but also that these are utf-8 characters, not escaped representations. It's served me well in one iteration or another since around 2010. Hopefully everyone else can make some use of it now.

    If you don't want the simplified chars (I can't imagine why, it's not come up once in 9 years), iterate over all the chars from ['一-龥'] and try to build a new list. Or run two regex's, one to check it is Chinese, but is not simplified Chinese

    0 讨论(0)
提交回复
热议问题