Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端未结

关注

 5  1622

后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答

余生分开走 (楼主)

2021-01-31 23:36
I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)
```
# -*- coding: utf-8 -*-
import re
def group_words(s):
    regex = []

    # Match a whole word:
    regex += [ur'\w+']

    # Match a single CJK character:
    regex += [ur'[\u4e00-\ufaff]']

    # Match one of anything else, except for spaces:
    regex += [ur'[^\s]']

    regex = "|".join(regex)
    r = re.compile(regex)

    return r.findall(s)

if __name__ == "__main__":
    print group_words(u"Testing English text")
    print group_words(u"我爱蟒蛇")
    print group_words(u"Testing English text我爱蟒蛇")
```
In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...