Why does spaCy not preserve intra-word-hyphens during tokenization like Stanford CoreNLP does?

后端 未结 1 1562
眼角桃花
眼角桃花 2020-12-21 21:41

SpaCy Version: 2.0.11

Python Version: 3.6.5

OS: Ubuntu 16.04

My Sentence Samples:

Marketing-Representative- won\'t die in car accident.

相关标签:
1条回答
  • 2020-12-21 22:18

    Although not documented at spacey usage site ,

    It looks like that we just need to add regex for *fix we are working with, in this case infix.

    Also, it appears we can extend nlp.Defaults.prefixes with custom regex

    infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
    

    This will give you desired result. There is no need set default to prefix and suffix since we are not working with those.

    import spacy
    from spacy.tokenizer import Tokenizer
    from spacy.util import compile_prefix_regex, compile_infix_regex, compile_suffix_regex
    import re
    
    nlp = spacy.load('en')
    
    infixes = nlp.Defaults.prefixes + (r"[./]", r"[-]~", r"(.'.)")
    
    infix_re = spacy.util.compile_infix_regex(infixes)
    
    def custom_tokenizer(nlp):
        return Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)
    
    nlp.tokenizer = custom_tokenizer(nlp)
    
    s1 = "Marketing-Representative- won't die in car accident."
    s2 = "Out-of-box implementation"
    
    for s in s1,s2:
        doc = nlp("{}".format(s))
        print([token.text for token in doc])
    

    Result

    $python3 /tmp/nlp.py  
    ['Marketing-Representative-', 'wo', "n't", 'die', 'in', 'car', 'accident', '.']  
    ['Out-of-box', 'implementation']  
    

    You may want to fix addon regex to make it more robust for other kind of tokens that are close to the applied regex.

    0 讨论(0)
提交回复
热议问题