Using PhraseMatcher in SpaCy to find multiple match types

后端 未结 1 625
走了就别回头了
走了就别回头了 2020-12-28 19:49

The SpaCy documentation and samples show that the PhraseMatcher class is useful to match sequences of tokens in documents. One must provide a vocabulary of sequences that wi

1条回答
  •  野趣味
    野趣味 (楼主)
    2020-12-28 20:47

    spaCy's PhraseMatcher supports adding multiple rules containing several patterns, and assigning IDs to each matcher rule you add. If two rules overlap, both matches will be returned. So you could do something like this:

    color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
    product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
    material_patterns = [nlp(text) for text in ('silk', 'yellow fabric')]
    
    matcher = PhraseMatcher(nlp.vocab)
    matcher.add('COLOR', None, *color_patterns)
    matcher.add('PRODUCT', None, *product_patterns)
    matcher.add('MATERIAL', None, *material_patterns)
    

    When you call the matcher on your doc, spaCy will return a list of (match_id, start, end) tuples. Because spaCy stores all strings as integers, the match_id you get back will be an integer, too – but you can always get the string representation by looking it up in the vocabulary's StringStore, i.e. nlp.vocab.strings:

    doc = nlp("yellow fabric")
    matches = matcher(doc)
    for match_id, start, end in matches:
        rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
        span = doc[start : end]  # get the matched slice of the doc
        print(rule_id, span.text)
    
    # COLOR yellow
    # MATERIAL yellow fabric
    

    When you add matcher rules, you can also define an on_match callback function as the second argument of Matcher.add. This is often useful if you want to trigger specific actions – for example, do one thing if a COLOR match is found, and something else for a PRODUCT match.

    If you want to solve this even more elegantly, you might also want to look into combining your matcher with a custom pipeline component or custom attributes. For example, you could write a simple component that's run automatically when you call nlp() on your text, finds the matches, and sets a Doc._.contains_product or Token._.is_color attribute. The docs have a few examples of this that should help you get started.

    0 讨论(0)
提交回复
热议问题