Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

前端 未结 1 351
清酒与你
清酒与你 2020-12-29 13:53

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Githu

相关标签:
1条回答
  • 2020-12-29 14:26

    Using the default prefix_re and suffix_re gives me the expected output:

    import re
    import spacy
    from spacy.tokenizer import Tokenizer
    from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
    
    def custom_tokenizer(nlp):
        infix_re = re.compile(r'''[.\,\?\:\;\...\‘\’\`\“\”\"\'~]''')
        prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
        suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)
    
        return Tokenizer(nlp.vocab, prefix_search=prefix_re.search,
                                    suffix_search=suffix_re.search,
                                    infix_finditer=infix_re.finditer,
                                    token_match=None)
    
    nlp = spacy.load('en')
    nlp.tokenizer = custom_tokenizer(nlp)
    
    doc = nlp(u'Note: Since the fourteenth century the practice of “medicine” has become a profession; and more importantly, it\'s a male-dominated profession.')
    [token.text for token in doc]
    

    ['Note', ':', 'Since', 'the', 'fourteenth', 'century', 'the', 'practice', 'of', '“', 'medicine', '”', 'has', 'become', 'a', 'profession', ';', 'and', 'more', 'importantly', ',', 'it', "'s", 'a', 'male-dominated', 'profession', '.']

    If you want to dig into to why your regexes weren't working like SpaCy's, here are links to the relevant source code:

    Prefixes and suffixes defined here:

    https://github.com/explosion/spaCy/blob/master/spacy/lang/punctuation.py

    With reference to characters (e.g, quotes, hyphens, etc.) defined here:

    https://github.com/explosion/spaCy/blob/master/spacy/lang/char_classes.py

    And the functions used to compile them (e.g., compile_prefix_regex):

    https://github.com/explosion/spaCy/blob/master/spacy/util.py

    0 讨论(0)
提交回复
热议问题