How to “update” an existing Named Entity Recognition model - rather than creating from scratch?

后端 未结 2 664
无人共我
无人共我 2020-12-06 07:54

Please see the tutorial steps for OpenNLP - Named Entity Recognition : Link to tutorial I am using the \"en-ner-person.bin\" model found here In the tutorial, there are inst

相关标签:
2条回答
  • 2020-12-06 08:07

    Sorry it took me a while to put together a decent code example... What the code below does is read in your sentences, uses the default en-ner-person model to do it's best. Then it writes those results to a file of the good hits, and a file of the bad hits . Then I feed those files into the "modelbuilder-addon" call at the bottom.

    To get the best results, run the class as is... then go into the known entities file and the blacklist file, and add and remove names. In other words, put names that it did not find at all, but you are aware of, into the knowns, and remove bad names from the knowns. Remove good names from the blacklist file, and add them to the knowns file. Then run the model builder part again without the first part that reads in all your data and everything. It's ok to have duplicates in the knowns and blacklist files. If you have questions let me know... it's a bit complicated

    import java.io.File;
    import java.io.FileWriter;
    import java.util.ArrayList;
    import java.util.List;
    import java.util.Map;
    import opennlp.addons.modelbuilder.DefaultModelBuilderUtil;
    import opennlp.tools.entitylinker.EntityLinkerProperties;
    import opennlp.tools.namefind.NameFinderME;
    import opennlp.tools.namefind.TokenNameFinderModel;
    import opennlp.tools.util.Span;
    
    public class ModelBuilderAddonUse {
    //fill this method in with however you are going to get your data into a list of sentences..for me I am hitting a MySQL database
      private static List<String> getSentencesFromSomewhere() throws Exception {
        List<String> sentences = new ArrayList<>();
        int counter = 0;
        DocProvider dp = new DocProvider();
        String modelPath = "c:\\apache\\entitylinker\\";
        EntityLinkerProperties properties = new EntityLinkerProperties(new File(modelPath + "entitylinker.properties"));
        Map<Long, List<String>> docs = dp.getDocs(properties);
        for (Long key : docs.keySet()) {
          counter++;
          System.out.println("\t\tDOC: " + key + "\n\n");
          String docu = "";
          sentences.addAll(docs.get(key));
          counter++;
          if(counter > 1000){
            break;
          }
        }
        return sentences;
      }
    
      public static void main(String[] args) throws Exception {
        /**
         * establish a file to put sentences in
         */
        File sentences = new File("C:\\temp\\modelbuilder\\sentences.text");
    
        /**
         * establish a file to put your NER hits in (the ones you want to keep based
         * on prob)
         */
        File knownEntities = new File("C:\\temp\\modelbuilder\\knownentities.txt");
    
        /**
         * establish a BLACKLIST file to put your bad NER hits in (also can be based
         * on prob)
         */
        File blacklistedentities = new File("C:\\temp\\modelbuilder\\blentities.txt");
    
        /**
         * establish a file to write your annotated sentences to
         */
        File annotatedSentences = new File("C:\\temp\\modelbuilder\\annotatedSentences.txt");
    
        /**
         * establish a file to write your model to
         */
        File theModel = new File("C:\\temp\\modelbuilder\\theModel");
    
    
    //------------create a bunch of file writers to write your results and sentences to a file
    
        FileWriter sentenceWriter = new FileWriter(sentences, true);
        FileWriter blacklistWriter = new FileWriter(blacklistedentities, true);
        FileWriter knownEntityWriter = new FileWriter(knownEntities, true);
    
    //set some thresholds to decide where to write hits, you don't have to use these at all...
        double keeperThresh = .95;
        double blacklistThresh = .7;
    
    
        /**
         * Load your model as normal
         */
        TokenNameFinderModel personModel = new TokenNameFinderModel(new File("c:\\temp\\opennlpmodels\\en-ner-person.zip"));
        NameFinderME personFinder = new NameFinderME(personModel);
        /**
         * do your normal NER on the sentences you have
         */
        for (String s : getSentencesFromSomewhere()) {
          sentenceWriter.write(s.trim() + "\n");
          sentenceWriter.flush();
    
          String[] tokens = s.split(" ");//better to use a tokenizer really
          Span[] find = personFinder.find(tokens);
          double[] probs = personFinder.probs();
          String[] names = Span.spansToStrings(find, tokens);
          for (int i = 0; i < names.length; i++) {
            //YOU PROBABLY HAVE BETTER HEURISTICS THAN THIS TO MAKE SURE YOU GET GOOD HITS OUT OF THE DEFAULT MODEL
            if (probs[i] > keeperThresh) {
              knownEntityWriter.write(names[i].trim() + "\n");
            }
            if (probs[i] < blacklistThresh) {
              blacklistWriter.write(names[i].trim() + "\n");
            }
          }
          personFinder.clearAdaptiveData();
          blacklistWriter.flush();
          knownEntityWriter.flush();
        }
        //flush and close all the writers
        knownEntityWriter.flush();
        knownEntityWriter.close();
        sentenceWriter.flush();
        sentenceWriter.close();
        blacklistWriter.flush();
        blacklistWriter.close();
    
        /**
         * THIS IS WHERE THE ADDON IS GOING TO USE THE FILES (AS IS) TO CREATE A NEW MODEL. YOU SHOULD NOT HAVE TO RUN THE FIRST PART AGAIN AFTER THIS RUNS, JUST NOW PLAY WITH THE
         * KNOWN ENTITIES AND BLACKLIST FILES AND RUN THE METHOD BELOW AGAIN UNTIL YOU GET SOME DECENT RESULTS (A DECENT MODEL OUT OF IT).
         */
        DefaultModelBuilderUtil.generateModel(sentences, knownEntities, blacklistedentities,
                theModel, annotatedSentences, "person", 3);
    
    
      }
    }
    

    this is what the console should look like ( I removed some lines for brevity here)

    ITERATION: 0
        Perfoming Known Entity Annotation
            knowns: 625
            reading data....: 
            writing annotated sentences....: 
            building model.... 
        Building Model using 7343 annotations
            reading training data...
    Indexing events using cutoff of 5
    
        Computing event counts...  done. 561755 events
        Indexing...  done.
    Sorting and merging events... done. Reduced 561755 events to 127362.
    Done indexing.
    Incorporating indexed data for training...  
    done.
        Number of Event Tokens: 127362
            Number of Outcomes: 3
          Number of Predicates: 106490
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-617150.9462211537  0.015709695507828147
      2:  ... loglikelihood=-90520.86903515142  0.9771288195031642
      3:  ... loglikelihood=-56901.86905339755  0.9771288195031642
      4:  ... loglikelihood=-44231.80460317638  0.9773086131854634
      5:  ... loglikelihood=-37222.56576767385  0.9787985865724381
      6:  ... loglikelihood=-32900.5623814595   0.9801924326441243
      7:  ... loglikelihood=-29992.881445391187 0.9829747843810914
      8:  ... loglikelihood=-27893.341149419102 0.9836423351817073
      9:  ... loglikelihood=-26296.107313900917 0.9845092611547739
     10:  ... loglikelihood=-25033.501573153182 0.9850682236918229
     11:  ... loglikelihood=-24006.060636903556 0.9856182855515305
     12:  ... loglikelihood=-23150.856525607975 0.9859084476328649
     13:  ... loglikelihood=-22425.987337392176 0.9861897090368577
     14:  ... loglikelihood=-21802.386362016423 0.9864211266477378
     15:  ... loglikelihood=-21259.20580401235  0.9865208142339632
     16:  ... loglikelihood=-20781.0716762281   0.9867362106256287
     17:  ... loglikelihood=-20356.37732369309  0.986905323495118
     18:  ... loglikelihood=-19976.18228587008  0.9870673158227341
     19:  ... loglikelihood=-19633.47877575036  0.9872097266601988
     20:  ... loglikelihood=-19322.689448146353 0.9873165347882974
     21:  ... loglikelihood=-19039.31522510173  0.9874073216971812
     22:  ... loglikelihood=-18779.683112448918 0.9875176900962164
     23:  ... loglikelihood=-18540.76222439295  0.9876316187661881
     24:  ... loglikelihood=-18320.027315327916 0.9877081645913254
     25:  ... loglikelihood=-18115.35602743375  0.9877918309583359
     26:  ... loglikelihood=-17924.95047403401  0.9878612562416
     27:  ... loglikelihood=-17747.27665623459  0.9879378020667373
     28:  ... loglikelihood=-17581.01712643139  0.9879947664017231
     29:  ... loglikelihood=-17425.03361369085  0.9880784327687337
     30:  ... loglikelihood=-17278.3372262906   0.9881282765618463
     31:  ... loglikelihood=-17140.06447937828  0.9882012621160471
     32:  ... loglikelihood=-17009.45784626013  0.9882546661800963
     33:  ... loglikelihood=-16885.84985637711  0.9883187510569554
     34:  ... loglikelihood=-16768.64999916476  0.9883703749855364
     35:  ... loglikelihood=-16657.3338665414   0.9884166585077124
     36:  ... loglikelihood=-16551.434095577726 0.9884558214880153
     37:  ... loglikelihood=-16450.532769374073 0.9885074454165962
     38:  ... loglikelihood=-16354.255007222264 0.9885448282614306
     39:  ... loglikelihood=-16262.263530858221 0.9885733104289236
     40:  ... loglikelihood=-16174.254036589966 0.9886391754412511
     41:  ... loglikelihood=-16089.951236435176 0.9886765582860856
     42:  ... loglikelihood=-16009.105457548561 0.9887281822146665
     43:  ... loglikelihood=-15931.489709807445 0.988747763704818
     44:  ... loglikelihood=-15856.897147780543 0.9887798061432475
     45:  ... loglikelihood=-15785.138866385483 0.9888065081752722
     46:  ... loglikelihood=-15716.041980029182 0.9888349903427651
     47:  ... loglikelihood=-15649.447943527766 0.9888581321038531
     48:  ... loglikelihood=-15585.211079986258 0.9888901745422827
     49:  ... loglikelihood=-15523.19728647256  0.9889328977935221
     50:  ... loglikelihood=-15463.282892914636 0.9889595998255467
     51:  ... loglikelihood=-15405.353653492159 0.9889685005028883
     52:  ... loglikelihood=-15349.303852923775 0.9889809614511664
     53:  ... loglikelihood=-15295.035512678789 0.9889934223994445
     54:  ... loglikelihood=-15242.457684348112 0.989013003889596
     55:  ... loglikelihood=-15191.485819217298 0.9890236847024059
     56:  ... loglikelihood=-15142.041204645499 0.9890397059216206
     57:  ... loglikelihood=-15094.050459152337 0.9890539470053671
     58:  ... loglikelihood=-15047.445079207273 0.9890592874117721
     59:  ... loglikelihood=-15002.161031666768 0.9890753086309868
     60:  ... loglikelihood=-14958.13838658306  0.9890966702566065
     61:  ... loglikelihood=-14915.320985817205 0.9891180318822262
     62:  ... loglikelihood=-14873.656143433394 0.9891269325595677
     63:  ... loglikelihood=-14833.094374397517 0.9891500743206558
     64:  ... loglikelihood=-14793.589148498404 0.9891589749979973
     65:  ... loglikelihood=-14755.096666806796 0.9891785564881488
     66:  ... loglikelihood=-14717.5756582924   0.9891892373009586
     67:  ... loglikelihood=-14680.98719451864  0.9891892373009586
     68:  ... loglikelihood=-14645.294520562966 0.9891945777073635
     69:  ... loglikelihood=-14610.462900520715 0.9891999181137685
     70:  ... loglikelihood=-14576.45947616036  0.989214159197515
     71:  ... loglikelihood=-14543.25313742511  0.9892212797393881
     72:  ... loglikelihood=-14510.814403643026 0.9892230598748565
     73:  ... loglikelihood=-14479.115314429962 0.9892230598748565
     74:  ... loglikelihood=-14448.129329357815 0.9892426413650078
     75:  ... loglikelihood=-14417.831235594616 0.9892515420423494
     76:  ... loglikelihood=-14388.19706276905  0.9892622228551593
     77:  ... loglikelihood=-14359.204004414    0.9892711235325008
     78:  ... loglikelihood=-14330.8303454032   0.9892764639389058
     79:  ... loglikelihood=-14303.055394843146 0.9892764639389058
     80:  ... loglikelihood=-14275.859423957678 0.9892924851581205
     81:  ... loglikelihood=-14249.223608524193 0.9893013858354621
     82:  ... loglikelihood=-14223.129975482772 0.9893209673256135
     83:  ... loglikelihood=-14197.561353359844 0.9893263077320185
     84:  ... loglikelihood=-14172.50132620183  0.9893280878674867
     85:  ... loglikelihood=-14147.934190713178 0.9893263077320185
     86:  ... loglikelihood=-14123.84491635766  0.9893316481384233
     87:  ... loglikelihood=-14100.21910816809  0.9894313357246487
     88:  ... loglikelihood=-14077.042972066316 0.989433115860117
     89:  ... loglikelihood=-14054.303282478262 0.9894437966729268
     90:  ... loglikelihood=-14031.987352086799 0.9894580377566733
     91:  ... loglikelihood=-14010.083003539214 0.9894615980276099
     92:  ... loglikelihood=-13988.578542971209 0.9894776192468246
     93:  ... loglikelihood=-13967.46273521311  0.9894811795177613
     94:  ... loglikelihood=-13946.724780546094 0.9894829596532296
     95:  ... loglikelihood=-13926.354292898612 0.9894829596532296
     96:  ... loglikelihood=-13906.341279379953 0.9894900801951029
     97:  ... loglikelihood=-13886.676121050288 0.9894936404660395
     98:  ... loglikelihood=-13867.34955484593  0.9894954206015077
     99:  ... loglikelihood=-13848.35265657199  0.9894954206015077
    100:  ... loglikelihood=-13829.676824889664 0.9894972007369761
        model generated
            model building complete.... 
            annotated sentences: 7343
        Performing NER with new model
            Printing NER Results. Add undesired results to the blacklist file and start over
    
    //prints some names
    
        annotated sentences: 7369
            knowns: 651
    ITERATION: 1
        Perfoming Known Entity Annotation
            knowns: 651
            reading data....: 
            writing annotated sentences....: 
            building model.... 
        Building Model using 20370 annotations
            reading training data...
    Indexing events using cutoff of 5
    
        Computing event counts...  done. 1116781 events
        Indexing...  done.
    Sorting and merging events... done. Reduced 1116781 events to 288251.
    Done indexing.
    Incorporating indexed data for training...  
    done.
        Number of Event Tokens: 288251
            Number of Outcomes: 3
          Number of Predicates: 206399
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-1226909.3303549637 0.03418485808766446
      2:  ... loglikelihood=-196688.7107544095  0.9622047653031346
      3:  ... loglikelihood=-138615.22912914792 0.9651462551744702
      4:  ... loglikelihood=-114777.09879832959 0.9697075791941303
      5:  ... loglikelihood=-101055.0229949508  0.9716443958126079
      6:  ... loglikelihood=-92253.8923255943   0.973049326591337
      7:  ... loglikelihood=-86146.35307405592  0.9750121107003074
      8:  ... loglikelihood=-81641.85792288609  0.975682788299586
      9:  ... loglikelihood=-78164.62963136223  0.9762594456746667
     10:  ... loglikelihood=-75386.40867917785  0.9767044747358703
     11:  ... loglikelihood=-73106.85371375803  0.9770590652957025
     12:  ... loglikelihood=-71196.60721959372  0.9774718588514668
     13:  ... loglikelihood=-69568.23683712543  0.9777279520335679
     14:  ... loglikelihood=-68160.39924327709  0.9779374828189233
     15:  ... loglikelihood=-66928.70260893498  0.9780914969004666
     16:  ... loglikelihood=-65840.17418566217  0.9782661058882628
     17:  ... loglikelihood=-64869.77222395241  0.9784040022170865
     18:  ... loglikelihood=-63998.109674075415 0.9785159310554173
     19:  ... loglikelihood=-63209.92394252923  0.9786475593692944
     20:  ... loglikelihood=-62493.02131098982  0.9787505339005589
     21:  ... loglikelihood=-61837.53211219312  0.9788597764467698
     22:  ... loglikelihood=-61235.37451190329  0.9789457377946079
     23:  ... loglikelihood=-60679.86146007204  0.9790003590677133
     24:  ... loglikelihood=-60165.407875448924 0.979062143786472
     25:  ... loglikelihood=-59687.30928567587  0.9791346736737104
     26:  ... loglikelihood=-59241.572255584455 0.979201830976709
     27:  ... loglikelihood=-58824.78291785096  0.9792698837104141
     28:  ... loglikelihood=-58434.00392167818  0.979333459290586
     29:  ... loglikelihood=-58066.69284046825  0.979381812548745
     30:  ... loglikelihood=-57720.63696783972  0.9794355383911438
     31:  ... loglikelihood=-57393.9007602091   0.9795089637090889
     32:  ... loglikelihood=-57084.78313293037  0.9795483626601814
     33:  ... loglikelihood=-56791.78250307578  0.9795743301506741
     34:  ... loglikelihood=-56513.567973701254 0.9796298468544863
     35:  ... loglikelihood=-56248.955425711436 0.9796808864047651
     36:  ... loglikelihood=-55996.887560355084 0.9797202853558576
     37:  ... loglikelihood=-55756.41714443519  0.9797543117227102
     38:  ... loglikelihood=-55526.69286884015  0.9797963969659226
     39:  ... loglikelihood=-55306.94735282102  0.9798152010107621
     40:  ... loglikelihood=-55096.48692031122  0.9798563908232679
     41:  ... loglikelihood=-54894.68284780714  0.9799029532200136
     42:  ... loglikelihood=-54700.963840494    0.9799378750175728
     43:  ... loglikelihood=-54514.80953871555  0.9799656333694788
     44:  ... loglikelihood=-54335.744892614406 0.9800005551670381
     45:  ... loglikelihood=-54163.33527156895  0.9800301043803574
     46:  ... loglikelihood=-53997.182198154995 0.9800551764401436
     47:  ... loglikelihood=-53836.91961491415  0.980082039361343
     48:  ... loglikelihood=-53682.210607423985 0.980112484005369
     49:  ... loglikelihood=-53532.74451955152  0.980140242357275
     50:  ... loglikelihood=-53388.23440690913  0.9801688961398878
     51:  ... loglikelihood=-53248.41478285541  0.9801921773382606
     52:  ... loglikelihood=-53113.03961847529  0.9802109813831001
     53:  ... loglikelihood=-52981.880563479055 0.9802351580121796
     54:  ... loglikelihood=-52854.7253600851   0.9802584392105524
     55:  ... loglikelihood=-52731.37642565477  0.9802727661018589
     56:  ... loglikelihood=-52611.64958353087  0.9803005244537649
     57:  ... loglikelihood=-52495.37292415569  0.9803148513450712
     58:  ... loglikelihood=-52382.38578113555  0.9803470868505105
     59:  ... loglikelihood=-52272.53780883427  0.9803748452024166
     60:  ... loglikelihood=-52165.68814994865  0.9803891720937229
     61:  ... loglikelihood=-52061.7046829472   0.9804043944157359
     62:  ... loglikelihood=-51960.46334051503  0.9804151395842157
     63:  ... loglikelihood=-51861.84749132724  0.9804393162132952
     64:  ... loglikelihood=-51765.74737831825  0.9804491659510683
     65:  ... loglikelihood=-51672.05960757943  0.9804634928423747
     66:  ... loglikelihood=-51580.686682513515 0.9804876694714542
     67:  ... loglikelihood=-51491.53657871175  0.9805046826548804
     68:  ... loglikelihood=-51404.52235540815  0.9805172186847735
     69:  ... loglikelihood=-51319.56179989248  0.9805315455760798
     70:  ... loglikelihood=-51236.577101627925 0.9805440816059728
     71:  ... loglikelihood=-51155.494553260556 0.9805584084972793
     72:  ... loglikelihood=-51076.24427590388  0.980569153665759
     73:  ... loglikelihood=-50998.75996642977  0.9805825851263587
     74:  ... loglikelihood=-50922.97866477339  0.9805951211562518
     75:  ... loglikelihood=-50848.84053937224  0.9806112389089714
     76:  ... loglikelihood=-50776.28868909037  0.9806264612309844
     77:  ... loglikelihood=-50705.2689602481   0.9806389972608774
     78:  ... loglikelihood=-50635.729777298875 0.9806470561372372
     79:  ... loglikelihood=-50567.62198610024  0.9806658601820769
     80:  ... loglikelihood=-50500.8987085974   0.9806685464741968
     81:  ... loglikelihood=-50435.51520800019  0.9806775007812633
     82:  ... loglikelihood=-50371.42876358994  0.9806837687962098
     83:  ... loglikelihood=-50308.59855431275  0.9806918276725697
     84:  ... loglikelihood=-50246.98555046764  0.9806989911182228
     85:  ... loglikelihood=-50186.55241287111  0.980703468271756
     86:  ... loglikelihood=-50127.26339882067  0.9807195860244757
     87:  ... loglikelihood=-50069.08427441567  0.9807312266236621
     88:  ... loglikelihood=-50011.9822326526   0.9807357037771953
     89:  ... loglikelihood=-49955.92581691934  0.9807446580842618
     90:  ... loglikelihood=-49900.88484943885  0.9807527169606216
     91:  ... loglikelihood=-49846.83036430355  0.9807634621291014
     92:  ... loglikelihood=-49793.734544757914 0.9807724164361679
     93:  ... loglikelihood=-49741.57066440427  0.9807786844511144
     94:  ... loglikelihood=-49690.31303207665  0.9807840570353543
     95:  ... loglikelihood=-49639.93694007888  0.9807948022038341
     96:  ... loglikelihood=-49590.418615580194 0.9808001747880739
     97:  ... loglikelihood=-49541.73517492774  0.9808073382337271
     98:  ... loglikelihood=-49493.86458067577  0.9808145016793803
     99:  ... loglikelihood=-49446.785601155134 0.9808234559864467
    100:  ... loglikelihood=-49400.477772387036 0.9808359920163399
        model generated
            model building complete.... 
            annotated sentences: 20370
        Performing NER with new model
    
    
    it will do this for each iteration  util you see
    ......
     97:  ... loglikelihood=-49140.50129715517  0.9808462362240823
     98:  ... loglikelihood=-49095.42289306763  0.9808641444693966
     99:  ... loglikelihood=-49051.095083380205 0.9808713077675223
    100:  ... loglikelihood=-49007.49834809576  0.9808748894165852
        model generated
    

    you can change the num iterations if you see the annotated sentences stop changing, and the knowns stop changing on subsequent runs as you refine the lists.

    HTH

    0 讨论(0)
  • 2020-12-06 08:12

    There is no way to append to a model unfortunately. But you can use to model to find what it can find, and write the hits it found to a "known entities" file, and also write out the sentences to a file. You can then add the other names you know are not getting recognized to the "known entities" file (and more sentences they might be in to the sentences file). Then you can use an OpenNLP addon called modelbuilder-addon to build a new model using the file of sentences, and the file of "known entities"

    see this post for a code example.

    OpenNLP: foreign names does not get recognized

    it's a very new addon, let me know if how it works.

    0 讨论(0)
提交回复
热议问题