Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端未结

关注

 5  1614

后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答

迷失自我 (楼主)

2021-01-31 23:20

The following works for python3.7:

import re
def group_words(s):
    return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)


if __name__ == "__main__":
    print(group_words(u"Testing English text"))
    print(group_words(u"我爱蟒蛇"))
    print(group_words(u"Testing English text我爱蟒蛇"))

['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']

For some reason, I cannot adapt Glenn Maynard's answer to python3.

0 讨论(0)

查看其它5个回答