Python regex: tokenizing English contractions

前端 未结 5 1514
天涯浪人
天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答
  •  时光取名叫无心
    2021-01-20 22:30

    Here a simple one

    text = ' ' + text.lower() + ' '
    text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
        .replace("'s ", ' is ').replace("'m ", ' am ') \
        .replace("'ll ", ' will ').replace("'d ", ' would ') \
        .replace("'re ", ' are ').replace("'ve ", ' have ')
    

提交回复
热议问题