Python regex: tokenizing English contractions

前端 未结 5 1512
天涯浪人
天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答
  •  一向
    一向 (楼主)
    2021-01-20 22:18

    You can use the following complete regexes :

    import re
    patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
    pattern=re.compile('|'.join(patterns_list))
    s="I wouldn't've done that."
    
    print [i for i in pattern.split(s) if i]
    

    result :

    ['I', 'would', "n't", "'ve", 'done', 'that.']
    

提交回复
热议问题