Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1620
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

相关标签:
5条回答
  • 2021-01-31 23:20

    The following works for python3.7:

    import re
    def group_words(s):
        return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
    
    
    if __name__ == "__main__":
        print(group_words(u"Testing English text"))
        print(group_words(u"我爱蟒蛇"))
        print(group_words(u"Testing English text我爱蟒蛇"))
    
    ['Testing', 'English', 'text']
    ['我', '爱', '蟒', '蛇']
    ['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
    

    For some reason, I cannot adapt Glenn Maynard's answer to python3.

    0 讨论(0)
  • 2021-01-31 23:26

    In Python 3, it also splits the number if you needed.

    def spliteKeyWord(str):
        regex = r"[\u4e00-\ufaff]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
        matches = re.findall(regex, str, re.UNICODE)
        return matches
    
    print(spliteKeyWord("Testing English text我爱Python123"))
    

    => ['Testing', 'English', 'text', '我', '爱', 'Python', '123']

    0 讨论(0)
  • 2021-01-31 23:33

    Formatting a list shows the repr of its components. If you want to view the strings naturally rather than escaped, you'll need to format it yourself. (repr should not be escaping these characters; repr(u'我') should return "u'我'", not "u'\\u6211'. Apparently this does happen in Python 3; only 2.x is stuck with the English-centric escaping for Unicode strings.)

    A basic algorithm you can use is assigning a character class to each character, then grouping letters by class. Starter code is below.

    I didn't use a doctest for this because I hit some odd encoding issues that I don't want to look into (out of scope). You'll need to implement a correct grouping function.

    Note that if you're using this for word wrapping, there are other per-language considerations. For example, you don't want to break on non-breaking spaces; you do want to break on hyphens; for Japanese you don't want to split apart きゅ; and so on.

    # -*- coding: utf-8 -*-
    import itertools, unicodedata
    
    def group_words(s):
        # This is a closure for key(), encapsulated in an array to work around
        # 2.x's lack of the nonlocal keyword.
        sequence = [0x10000000]
    
        def key(part):
            val = ord(part)
            if part.isspace():
                return 0
    
            # This is incorrect, but serves this example; finding a more
            # accurate categorization of characters is up to the user.
            asian = unicodedata.category(part) == "Lo"
            if asian:
                # Never group asian characters, by returning a unique value for each one.
                sequence[0] += 1
                return sequence[0]
    
            return 2
    
        result = []
        for key, group in itertools.groupby(s, key):
            # Discard groups of whitespace.
            if key == 0:
                continue
    
            str = "".join(group)
            result.append(str)
    
        return result
    
    if __name__ == "__main__":
        print group_words(u"Testing English text")
        print group_words(u"我爱蟒蛇")
        print group_words(u"Testing English text我爱蟒蛇")
    
    0 讨论(0)
  • 2021-01-31 23:36

    I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

    # -*- coding: utf-8 -*-
    import re
    def group_words(s):
        regex = []
    
        # Match a whole word:
        regex += [ur'\w+']
    
        # Match a single CJK character:
        regex += [ur'[\u4e00-\ufaff]']
    
        # Match one of anything else, except for spaces:
        regex += [ur'[^\s]']
    
        regex = "|".join(regex)
        r = re.compile(regex)
    
        return r.findall(s)
    
    if __name__ == "__main__":
        print group_words(u"Testing English text")
        print group_words(u"我爱蟒蛇")
        print group_words(u"Testing English text我爱蟒蛇")
    

    In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

    0 讨论(0)
  • 2021-01-31 23:39

    Modified Glenn's solution to drop symbols and work for Russian, French, etc alphabets:

    def rec_group_words():
        regex = []
    
        # Match a whole word:
        regex += [r'[A-za-z0-9\xc0-\xff]+']
    
        # Match a single CJK character:
        regex += [r'[\u4e00-\ufaff]']
    
        regex = "|".join(regex)
        return re.compile(regex)
    
    0 讨论(0)
提交回复
热议问题