I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean
I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
# -*- coding: utf-8 -*-
import re
def group_words(s):
regex = []
# Match a whole word:
regex += [ur'\w+']
# Match a single CJK character:
regex += [ur'[\u4e00-\ufaff]']
# Match one of anything else, except for spaces:
regex += [ur'[^\s]']
regex = "|".join(regex)
r = re.compile(regex)
return r.findall(s)
if __name__ == "__main__":
print group_words(u"Testing English text")
print group_words(u"我爱蟒蛇")
print group_words(u"Testing English text我爱蟒蛇")
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.