Spacy, matcher with entities spanning more than a single token

后端 未结 1 1805
梦毁少年i
梦毁少年i 2021-01-27 05:27

I am trying to create a matcher that finds negated custom entities in the text. It is working fine for entities that span a single token, but I am having trouble trying to captu

相关标签:
1条回答
  • 2021-01-27 05:50

    A solution is to use the doc retokenize method in order to merge the individual tokens of each multi-token entity into a single token:

    import spacy
    from spacy.pipeline import EntityRuler
    nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)
    
    animal = ["cat", "dog", "artic fox"]
    ruler = EntityRuler(nlp)
    for a in animal:
        ruler.add_patterns([{"label": "animal", "pattern": a}])
    nlp.add_pipe(ruler)
    
    
    doc = nlp("There is no cat in the house and no artic fox in the basement")
    
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            retokenizer.merge(doc[ent.start:ent.end])
    
    
    from spacy.matcher import Matcher
    matcher = Matcher(nlp.vocab)
    pattern =[{'lower': 'no'},{'ENT_TYPE': {'REGEX': 'animal', 'OP': '+'}}]
    matcher.add('negated animal', None, pattern)
    matches = matcher(doc)
    
    
    for match_id, start, end in matches:
        span = doc[start:end]
        print(span)
    
    

    the output is now:

    no cat
    no artic fox

    0 讨论(0)
提交回复
热议问题