Python: any way to perform this “hybrid” split() on multi-lingual (e.g. Chinese & English) strings?

后端 未结 5 1622
后悔当初
后悔当初 2021-01-31 23:06

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don\'t (Chinese, Japanese, Korean

5条回答
  •  余生分开走
    2021-01-31 23:36

    I thought I'd show the regex approach, too. It doesn't feel right to me, but that's mostly because all of the language-specific i18n oddnesses I've seen makes me worried that a regular expression might not be flexible enough for all of them--but you may well not need any of that. (In other words--overdesign.)

    # -*- coding: utf-8 -*-
    import re
    def group_words(s):
        regex = []
    
        # Match a whole word:
        regex += [ur'\w+']
    
        # Match a single CJK character:
        regex += [ur'[\u4e00-\ufaff]']
    
        # Match one of anything else, except for spaces:
        regex += [ur'[^\s]']
    
        regex = "|".join(regex)
        r = re.compile(regex)
    
        return r.findall(s)
    
    if __name__ == "__main__":
        print group_words(u"Testing English text")
        print group_words(u"我爱蟒蛇")
        print group_words(u"Testing English text我爱蟒蛇")
    

    In practice, you'd probably want to only compile the regex once, not on each call. Again, filling in the particulars of character grouping is up to you.

提交回复
热议问题