Python regex: tokenizing English contractions

前端未结

关注

 5  1514

天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

5条回答

时光取名叫无心 (楼主)

2021-01-20 22:30

Here a simple one

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')

0 讨论(0)

查看其它5个回答