What are the upper and lower bound for Chinese char in UTF-8?

后端 未结 1 913
小蘑菇
小蘑菇 2020-12-17 05:37

I would like to make a set in python contains all the ord() of the Chinese chars:

for English the equivalent is :

english = set(range(or         


        
相关标签:
1条回答
  • 2020-12-17 06:07

    From the Unicode Standard (v6.0, section 12.1),

    Han ideographic characters are found in seven main blocks of the Unicode Standard, as shown in Table 12-2

    Table 12-2. Blocks Containing Han Ideographs
    
    Block                                   | Range       | Comment
    ----------------------------------------+-------------+-----------------------------------------------------
    CJK Unified Ideographs                  | 4E00–9FFF   | Common
    CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
    CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
    CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
    CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
    CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
    CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants
    

    And there are a couple of extras, outside of these blocks:

    Table 12-3. Small Extensions to the URO
    
    Range     | Version | Comment
    ----------+---------+-------------------------------------------------
    9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
    9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
    9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
    9FC3      | 5.1     | Correction of mistaken unification
    9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
    9FC7–9FCB | 5.2     | Interoperability with HKSCS standard
    

    To use set operations to construct a set of the ordinal values of these, you can do this:

    chinese = set(range(0x4E00, 0xA000) +
                  range(0x3400, 0x4DC0) +
                  range(0x20000, 0x2A6E0) +
                  range(0x2A700, 0x2B740) +
                  range(0x2B740, 0x2B820) +
                  range(0xF900, 0xFB00) +
                  range(0x2F800, 0x2FA20) +
                  range(0x9FA6, 0x9FCC))
    

    Be aware, though, that this set contains over 75000 characters, so it may not be the most compact or efficient data structure for this.

    Also, if you insist on using ord() on literal characters, you will need to use the 32-bit unicode literal form:

    >>> ord(u'\U00002F800')
    194560
    
    0 讨论(0)
提交回复
热议问题