Python regex: tokenizing English contractions

前端 未结 5 1511
天涯浪人
天涯浪人 2021-01-20 21:59

I am trying to parse strings in such a way as to separate out all word components, even those that have been contracted. For example the tokenization of \"shouldn\'t\" wou

相关标签:
5条回答
  • 2021-01-20 22:18

    You can use the following complete regexes :

    import re
    patterns_list = [r'\s',r'(n\'t)',r'\'m',r'(\'ll)',r'(\'ve)',r'(\'s)',r'(\'re)',r'(\'d)']
    pattern=re.compile('|'.join(patterns_list))
    s="I wouldn't've done that."
    
    print [i for i in pattern.split(s) if i]
    

    result :

    ['I', 'would', "n't", "'ve", 'done', 'that.']
    
    0 讨论(0)
  • 2021-01-20 22:30
    (?<!['"\w])(['"])?([a-zA-Z]+(?:('d|'ll|n't)('ve)?|('s|'m|'re|'ve)))(?(1)\1|(?!\1))(?!['"\w])
    

    EDIT: \2 is the match, \3 is the first group, \4 the second and \5 the third.

    0 讨论(0)
  • 2021-01-20 22:30

    Here a simple one

    text = ' ' + text.lower() + ' '
    text = text.replace(" won't ", ' will not ').replace("n't ", ' not ') \
        .replace("'s ", ' is ').replace("'m ", ' am ') \
        .replace("'ll ", ' will ').replace("'d ", ' would ') \
        .replace("'re ", ' are ').replace("'ve ", ' have ')
    
    0 讨论(0)
  • 2021-01-20 22:36
    >>> import nltk
    >>> nltk.word_tokenize("I wouldn't've done that.")
    ['I', "wouldn't", "'ve", 'done', 'that', '.']
    

    so:

    >>> from itertools import chain
    >>> [nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]
    [['I'], ['would', "n't"], ["'ve"], ['done'], ['that'], ['.']]
    >>> list(chain(*[nltk.word_tokenize(i) for i in nltk.word_tokenize("I wouldn't've done that.")]))
    ['I', 'would', "n't", "'ve", 'done', 'that', '.']
    
    0 讨论(0)
  • 2021-01-20 22:39

    You can use this regex to tokenize the text:

    (?:(?!.')\w)+|\w?'\w+|[^\s\w]
    

    Usage:

    >>> re.findall(r"(?:(?!.')\w)+|\w?'\w+|[^\s\w]", "I wouldn't've done that.")
    ['I', 'would', "n't", "'ve", 'done', 'that', '.']
    
    0 讨论(0)
提交回复
热议问题