I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean
The following works for python3.7:
import re
def group_words(s):
return re.findall(u'[\u4e00-\u9fff]|[a-zA-Z0-9]+', s)
if __name__ == "__main__":
print(group_words(u"Testing English text"))
print(group_words(u"我爱蟒蛇"))
print(group_words(u"Testing English text我爱蟒蛇"))
['Testing', 'English', 'text']
['我', '爱', '蟒', '蛇']
['Testing', 'English', 'text', '我', '爱', '蟒', '蛇']
For some reason, I cannot adapt Glenn Maynard's answer to python3.