Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1614
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答
  •  迷失自我
    2021-01-31 23:20

    The following works for python3.7:

    import re
    def group_words(s):
        return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
    
    
    if __name__ == "__main__":
        print(group_words(u"Testing English text"))
        print(group_words(u"我爱蟒蛇"))
        print(group_words(u"Testing English text我爱蟒蛇"))
    
    ['Testing', 'English', 'text']
    ['我', '爱', '蟒', '蛇']
    ['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
    

    For some reason, I cannot adapt Glenn Maynard's answer to python3.

提交回复
热议问题