Python regex: tokenizing English contractions

前端未结

关注

 5  1511

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

相关标签:

5条回答

一向

2021-01-20 22:18

You can use the following complete regexes :

import re
patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
pattern=re.compile('|'.join(patterns_list))
s="I wouldn't've done that."

print [i for i in pattern.split(s) if i]

result :

['I', 'would', "n't", "'ve", 'done', 'that.']

0 讨论(0)

感情败类

2021-01-20 22:30
```
(?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
```
EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光取名叫无心

2021-01-20 22:30

Here a simple one

text = ' ' + text.lower() + ' '
text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
    .replace("'s ", ' is ').replace("'m ", ' am ') \
    .replace("'ll ", ' will ').replace("'d ", ' would ') \
    .replace("'re ", ' are ').replace("'ve ", ' have ')

0 讨论(0)

长发绾君心

2021-01-20 22:36

>>> import nltk
>>> nltk.word_tokenize("I wouldn't've done that.")
['I', "wouldn't", "'ve", 'done', 'that', '.']

so:

>>> from itertools import chain
>>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
[['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
>>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
['I', 'would', "n't", "'ve", 'done', 'that', '.']

0 讨论(0)

北恋

2021-01-20 22:39

You can use this regex to tokenize the text:

(?:(?!.')\w)+|\w?'\w+|[^\s\w]

Usage:

>>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
['I', 'would', "n't", "'ve", 'done', 'that', '.']

0 讨论(0)