问题
Is there a way to train the existing Apache OpenNLP POS Tagger model? I need to add a few more proper nouns to the model that are specific to my application. When I try to use the below command:
opennlp POSTaggerTrainer -type maxent -model en-pos-maxent.bin \
-lang en -data en-pos.train -encoding UTF-8
the entire model is retrained. I'd only like to append a few new sentences to en-pos-maxent.bin
This is how my training file looks:
Where_WRB is_VBZ the_DT Seven_DNNP Dwarfs_DNNP Mine_DNNP Train_DNNP ?_?
Where_WRB is_VBZ the_DT Astro_DNNP Orbiter_DNNP ?_?
Where_WRB is_VBZ the_DT Barnstormer_DNNP ?_?
Where_WRB is_VBZ the_DT Big_DNNP Thunder_DNNP Mountain_DNNP Railroad_DNNP ?_?
Where_WRB is_VBZ the_DT Buzz_DNNP Lightyears_DNNP Space_DNNP Ranger_DNNP Spin_DNNP ?_?
Where_WRB is_VBZ the_DT Casey_DNNP Jr_DNNP Splash_DNNP N_DNNP Soak_DNNP Station_DNNP ?_?
Where_WRB is_VBZ the_DT Cinderella_DNNP Castle_DNNP ?_?
Where_WRB is_VBZ the_DT Country_DNNP Bear_DNNP Jamboree_DNNP ?_?
Where_WRB is_VBZ the_DT Dumbo_DNNP the_DNNP Flying_DNNP Elephant_DNNP ?_?
Where_WRB is_VBZ the_DT Enchanted_DNNP Tales_DNNP with_DNNP Belle_DNNP ?_?
Where_WRB is_VBZ the_DT Frontierland_DNNP Shootin_DNNP Arcade_DNNP ?_?
After training the model, all words except those in the training file are tagged as DNNP
.
For example, if I ask for the word 'Where' (present in the training file) to be tagged, the answer is WRB
, but if I ask the word 'hello' (not present in the training file) to be tagged, it is tagged as DNNP
. So I want to add a few words. How can I do that?
回答1:
Unfortunately, you can't simply augment OpenNLP models with additional training instances. You'd need to retrain the entire model with the original training data plus your new data to get the model you want. You'd need to use an existing (large) POS-tagged corpus plus your new examples to train a new POS tagger model.
If you just want to identify certain kinds of proper nouns, you could consider training an OpenNLP NameFinder (or other named entity extractor) with your data instead, since that kind of annotator is better-suited for identifying particular types of proper nouns. You only give a few examples above, but I think that a POS tagger will have trouble distinguishing normal NNPs from your new DNNPs because they appear in the same context as NNPs and have the same form (capitalized noun phrases). A named entity recognizer is a better tool for such a task.
回答2:
Also this was posted a while ago i can present you an answer: Yago Database.
I answered my own post here: Is there a way to get the "original" text data for OpenNLP?
Have a look at it
来源:https://stackoverflow.com/questions/27301800/is-it-possible-to-append-words-to-an-existing-opennlp-pos-corpus-model