How to get probability of prediction per entity from Spacy NER model?

后端 未结 2 452
不知归路
不知归路 2021-01-01 09:47

I used this official example code to train a NER model from scratch using my own training samples.

When I predict using this model on new text, I want to get the pr

相关标签:
2条回答
  • 2021-01-01 09:52

    Getting the probabilities of prediction per entity from a Spacy NER model is not trivial. Here is the solution adapted from here :

    
    import spacy
    from collections import defaultdict
    
    texts = ['John works at Microsoft.']
    
    # Number of alternate analyses to consider. More is slower, and not necessarily better -- you need to experiment on your problem.
    beam_width = 16
    # This clips solutions at each step. We multiply the score of the top-ranked action by this value, and use the result as a threshold. This prevents the parser from exploring options that look very unlikely, saving a bit of efficiency. Accuracy may also improve, because we've trained on greedy objective.
    beam_density = 0.0001 
    nlp = spacy.load('en_core_web_md')
    
    
    docs = list(nlp.pipe(texts, disable=['ner']))
    beams = nlp.entity.beam_parse(docs, beam_width=beam_width, beam_density=beam_density)
    
    for doc, beam in zip(docs, beams):
        entity_scores = defaultdict(float)
        for score, ents in nlp.entity.moves.get_beam_parses(beam):
            for start, end, label in ents:
                entity_scores[(start, end, label)] += score
    
    l= []
    for k, v in entity_scores.items():
        l.append({'start': k[0], 'end': k[1], 'label': k[2], 'prob' : v} )
    
    for a in sorted(l, key= lambda x: x['start']):
        print(a)
    
    ### Output: ####
    
    {'start': 0, 'end': 1, 'label': 'PERSON', 'prob': 0.4054479906820232}
    {'start': 0, 'end': 1, 'label': 'ORG', 'prob': 0.01002015005487447}
    {'start': 0, 'end': 1, 'label': 'PRODUCT', 'prob': 0.0008592912552754791}
    {'start': 0, 'end': 1, 'label': 'WORK_OF_ART', 'prob': 0.0007666755792166002}
    {'start': 0, 'end': 1, 'label': 'NORP', 'prob': 0.00034931990870877333}
    {'start': 0, 'end': 1, 'label': 'TIME', 'prob': 0.0002786051849320804}
    {'start': 3, 'end': 4, 'label': 'ORG', 'prob': 0.9990115861687987}
    {'start': 3, 'end': 4, 'label': 'PRODUCT', 'prob': 0.0003378157477046507}
    {'start': 3, 'end': 4, 'label': 'FAC', 'prob': 8.249734411749544e-05}
    
    
    0 讨论(0)
  • 2021-01-01 10:01

    Sorry I do not have any better answer - I can only confirm that the 'beam' solution does provide some 'probabilities' - though in my case I am getting way too many entities with prob=1.0, even in cases where I can only shake my head and blame it on too little training data.

    I find it quite strange that Spacy reports an 'entity' without having any confidence attached to it. I would assume there is some threshold to decide WHEN Spacy reports an entity and when it does NOT (perhaps I missed it). In my case, I see confidences 0.6 reported as 'this is an entity' while entity with confidence 0.001 is NOT reported.

    In my use-case, the confidence is essential. For a given text, Spacy (and for example Google ML) report multiple instances of 'MY_ENTITY'. My code has to decide which ones are to be 'trusted' and which ones are false positive. I have yet to see IF the 'probability' returned by the above code has any practical value.

    0 讨论(0)
提交回复
热议问题