How to get phrase count in Spacy phrasematcher

问题

I am trying spaCy's PhraseMatcher. I have used an adaptation of the example given in the website like below.

color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('bat', 'yellow ball')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)

doc = nlp("yellow ball yellow lines")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    print(rule_id, span.text)

The output is

COLOR yellow
MATERIAL ball

My question is how do I get the count of phrases such that my output looks like indicating yellow occurred twice and ball only once.

COLOR Yellow (2)
MATERIAL ball (1)

回答1:

Something like this?

from collections import Counter
from spacy.matcher import PhraseMatcher
color_patterns = [nlp(text) for text in ('red', 'green', 'yellow')]
product_patterns = [nlp(text) for text in ('boots', 'coats', 'bag')]
material_patterns = [nlp(text) for text in ('bat', 'yellow ball')]

matcher = PhraseMatcher(nlp.vocab)
matcher.add('COLOR', None, *color_patterns)
matcher.add('PRODUCT', None, *product_patterns)
matcher.add('MATERIAL', None, *material_patterns)
d = []
doc = nlp("yellow ball yellow lines")
matches = matcher(doc)
for match_id, start, end in matches:
    rule_id = nlp.vocab.strings[match_id]  # get the unicode ID, i.e. 'COLOR'
    span = doc[start : end]  # get the matched slice of the doc
    d.append((rule_id, span.text))
print("\n".join(f'{i[0]} {i[1]} ({j})' for i,j in Counter(d).items()))

Output:

COLOR yellow (2)
MATERIAL yellow ball (1)

来源：https://stackoverflow.com/questions/53461757/how-to-get-phrase-count-in-spacy-phrasematcher

标签

python-3.x

nlp

spacy