How to find out Chinese or Japanese Character in a String in Python?

前端 未结 4 1022
北海茫月
北海茫月 2021-01-31 05:16

Such as:

str = \'sdf344asfasf天地方益3権sdfsdf\'

Add () to Chinese and Japanese Characters:

strAfterConvert = \'sdfasf         


        
相关标签:
4条回答
  • 2021-01-31 05:43

    From one of the bleeding edge branch of NLTK inspired by the Moses Machine Translation Toolkit:

    def is_cjk(character):
        """"
        Checks whether character is CJK.
    
            >>> is_cjk(u'\u33fe')
            True
            >>> is_cjk(u'\uFE5F')
            False
    
        :param character: The character that needs to be checked.
        :type character: char
        :return: bool
        """
        return any([start <= ord(character) <= end for start, end in 
                    [(4352, 4607), (11904, 42191), (43072, 43135), (44032, 55215), 
                     (63744, 64255), (65072, 65103), (65381, 65500), 
                     (131072, 196607)]
                    ])
    

    For the specifics of the ord() numbers:

    class CJKChars(object):
        """
        An object that enumerates the code points of the CJK characters as listed on
        http://en.wikipedia.org/wiki/Basic_Multilingual_Plane#Basic_Multilingual_Plane
    
        This is a Python port of the CJK code point enumerations of Moses tokenizer:
        https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl#L309
        """
        # Hangul Jamo (1100–11FF)
        Hangul_Jamo = (4352, 4607) # (ord(u"\u1100"), ord(u"\u11ff"))
    
        # CJK Radicals Supplement (2E80–2EFF)
        # Kangxi Radicals (2F00–2FDF)
        # Ideographic Description Characters (2FF0–2FFF)
        # CJK Symbols and Punctuation (3000–303F)
        # Hiragana (3040–309F)
        # Katakana (30A0–30FF)
        # Bopomofo (3100–312F)
        # Hangul Compatibility Jamo (3130–318F)
        # Kanbun (3190–319F)
        # Bopomofo Extended (31A0–31BF)
        # CJK Strokes (31C0–31EF)
        # Katakana Phonetic Extensions (31F0–31FF)
        # Enclosed CJK Letters and Months (3200–32FF)
        # CJK Compatibility (3300–33FF)
        # CJK Unified Ideographs Extension A (3400–4DBF)
        # Yijing Hexagram Symbols (4DC0–4DFF)
        # CJK Unified Ideographs (4E00–9FFF)
        # Yi Syllables (A000–A48F)
        # Yi Radicals (A490–A4CF)
        CJK_Radicals = (11904, 42191) # (ord(u"\u2e80"), ord(u"\ua4cf"))
    
        # Phags-pa (A840–A87F)
        Phags_Pa = (43072, 43135) # (ord(u"\ua840"), ord(u"\ua87f"))
    
        # Hangul Syllables (AC00–D7AF)
        Hangul_Syllables = (44032, 55215) # (ord(u"\uAC00"), ord(u"\uD7AF"))
    
        # CJK Compatibility Ideographs (F900–FAFF)
        CJK_Compatibility_Ideographs = (63744, 64255) # (ord(u"\uF900"), ord(u"\uFAFF"))
    
        # CJK Compatibility Forms (FE30–FE4F)
        CJK_Compatibility_Forms = (65072, 65103) # (ord(u"\uFE30"), ord(u"\uFE4F"))
    
        # Range U+FF65–FFDC encodes halfwidth forms, of Katakana and Hangul characters
        Katakana_Hangul_Halfwidth = (65381, 65500) # (ord(u"\uFF65"), ord(u"\uFFDC"))
    
        # Supplementary Ideographic Plane 20000–2FFFF
        Supplementary_Ideographic_Plane = (131072, 196607) # (ord(u"\U00020000"), ord(u"\U0002FFFF"))
    
        ranges = [Hangul_Jamo, CJK_Radicals, Phags_Pa, Hangul_Syllables, 
                  CJK_Compatibility_Ideographs, CJK_Compatibility_Forms, 
                  Katakana_Hangul_Halfwidth, Supplementary_Ideographic_Plane]
    

    Combining the is_cjk() in this answer and @EvenLisle substring answer

    >>> from nltk.tokenize.util import is_cjk
    >>> text = u'sdf344asfasf天地方益3権sdfsdf'
    >>> [1 if is_cjk(ch) else 0 for ch in text]
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0]
    >>> def cjk_substrings(string):
    ...     i = 0
    ...     while i<len(string):
    ...         if is_cjk(string[i]):
    ...             start = i
    ...             while is_cjk(string[i]): i += 1
    ...             yield string[start:i]
    ...         i += 1
    ... 
    >>> string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
    >>> for sub in cjk_substrings(string):
    ...     string = string.replace(sub, "(" + sub + ")")
    ... 
    >>> string
    u'sdf344asfasf(\u5929\u5730\u65b9\u76ca)3(\u6a29)sdfsdf'
    >>> print string
    sdf344asfasf(天地方益)3(権)sdfsdf
    
    0 讨论(0)
  • 2021-01-31 05:47

    You can do the edit using the regex package, which supports checking the Unicode "Script" property of each character and is a drop-in replacement for the re package:

    import regex as re
    
    pattern = re.compile(r'([\p{IsHan}\p{IsBopo}\p{IsHira}\p{IsKatakana}]+)', re.UNICODE)
    
    input = u'sdf344asfasf天地方益3権sdfsdf'
    output = pattern.sub(r'(\1)', input)
    print output  # Prints: sdf344asfasf(天地方益)3(権)sdfsdf
    

    You should adjust the \p{Is...} sequences with the character scripts/blocks that you consider to be "Chinese or Japanese".

    0 讨论(0)
  • 2021-01-31 05:57

    If you can't use regex module that provides access to IsKatakana, IsHan properties as shown in @一二三's answer; you could use character ranges from @EvenLisle's answer with stdlib's re module:

    >>> import re
    >>> print(re.sub(u"([\u3300-\u33ff\ufe30-\ufe4f\uf900-\ufaff\U0002f800-\U0002fa1f\u30a0-\u30ff\u2e80-\u2eff\u4e00-\u9fff\u3400-\u4dbf\U00020000-\U0002a6df\U0002a700-\U0002b73f\U0002b740-\U0002b81f\U0002b820-\U0002ceaf]+)", r"(\1)", u'sdf344asfasf天地方益3権sdfsdf'))
    sdf344asfasf(天地方益)3(権)sdfsdf
    

    Beware of known issues.

    You could also check Unicode category:

    >>> import unicodedata
    >>> unicodedata.category(u'天')
    'Lo'
    >>> unicodedata.category(u's')
    'Ll'
    
    0 讨论(0)
  • 2021-01-31 06:02

    As a start, you can check if the character is in one of the following unicode blocks:

    • Unicode Block 'CJK Unified Ideographs' - U+4E00 to U+9FFF
    • Unicode Block 'CJK Unified Ideographs Extension A' - U+3400 to U+4DBF
    • Unicode Block 'CJK Unified Ideographs Extension B' - U+20000 to U+2A6DF
    • Unicode Block 'CJK Unified Ideographs Extension C' - U+2A700 to U+2B73F
    • Unicode Block 'CJK Unified Ideographs Extension D' - U+2B740 to U+2B81F

    After that, all you need to do is iterate through the string, checking if the char is Chinese, Japanese or Korean (CJK) and append accordingly:

    # -*- coding:utf-8 -*-
    ranges = [
      {"from": ord(u"\u3300"), "to": ord(u"\u33ff")},         # compatibility ideographs
      {"from": ord(u"\ufe30"), "to": ord(u"\ufe4f")},         # compatibility ideographs
      {"from": ord(u"\uf900"), "to": ord(u"\ufaff")},         # compatibility ideographs
      {"from": ord(u"\U0002F800"), "to": ord(u"\U0002fa1f")}, # compatibility ideographs
      {'from': ord(u'\u3040'), 'to': ord(u'\u309f')},         # Japanese Hiragana
      {"from": ord(u"\u30a0"), "to": ord(u"\u30ff")},         # Japanese Katakana
      {"from": ord(u"\u2e80"), "to": ord(u"\u2eff")},         # cjk radicals supplement
      {"from": ord(u"\u4e00"), "to": ord(u"\u9fff")},
      {"from": ord(u"\u3400"), "to": ord(u"\u4dbf")},
      {"from": ord(u"\U00020000"), "to": ord(u"\U0002a6df")},
      {"from": ord(u"\U0002a700"), "to": ord(u"\U0002b73f")},
      {"from": ord(u"\U0002b740"), "to": ord(u"\U0002b81f")},
      {"from": ord(u"\U0002b820"), "to": ord(u"\U0002ceaf")}  # included as of Unicode 8.0
    ]
    
    def is_cjk(char):
      return any([range["from"] <= ord(char) <= range["to"] for range in ranges])
    
    def cjk_substrings(string):
      i = 0
      while i<len(string):
        if is_cjk(string[i]):
          start = i
          while is_cjk(string[i]): i += 1
          yield string[start:i]
        i += 1
    
    string = "sdf344asfasf天地方益3権sdfsdf".decode("utf-8")
    for sub in cjk_substrings(string):
      string = string.replace(sub, "(" + sub + ")")
    print string
    

    The above prints

    sdf344asfasf(天地方益)3(権)sdfsdf
    

    To be future-proof, you might want to keep a lookout for CJK Unified Ideographs Extension E. It will ship with Unicode 8.0, which is scheduled for release in June 2015. I've added it to the ranges, but you shouldn't include it until Unicode 8.0 is released.

    [EDIT]

    Added CJK compatibility ideographs, Japanese Kana and CJK radicals.

    0 讨论(0)
提交回复
热议问题