问题
I am following the documentation for Apache open-nlp. I was able to understand the sentence detection, tokenizer, name-finder. But I got stuck for Categorizer. The reason, I can not understand, how to create a model for Categorization.
I do understand that I need to create a file. The format is very clear, it needs to be a category space and a document in a single line. Save the file with .train
extension.
So I created the following file:
Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
I gave this command-
opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8
It starts doing something and then returns with an error. These are the contents in the command prompt:
Indexing events using cutoff of 5
Computing event counts... done. 2 events
Indexing... Dropped event Refund:[bow=What, bow=is, bow=the, bow=refund, bow=status, bow=for, bow=my, bow=order, bow=#342, bow=?]
Dropped event NewOffers:[bow=Are, bow=there, bow=any, bow=new, bow=offers, bow=for, bow=your, bow=products, bow=?]
done.
Sorting and merging events... Done indexing.
Incorporating indexed data for training...
Exception in thread "main" java.lang.NullPointerException
at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
at opennlp.maxent.GIS.trainModel(GIS.java:256)
at opennlp.model.TrainUtil.train(TrainUtil.java:184)
at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:162)
at opennlp.tools.cmdline.doccat.DoccatTrainerTool.run(DoccatTrainerTool.java:61)
at opennlp.tools.cmdline.CLI.main(CLI.java:222)
I am just not able to figure out why is this giving a null pointer exception here? I also tried to increase two more lines, but no result.
Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?
I found this blog, but here also pretty much the same thing is done. On trying his training file it works with a charm. What is wrong in my file? How do I resolve the error.
When I try opennlp DoccatTrainer
it opens help for me, so path is not an issue. Any help is appreciated.
EDIT: I changed the file to
Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?
and it works, I thought it has got to do something with the document (apparently should be two sentences) and removed the last two lines.
to make it
Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
But then again it fails, the question now summarizes to what kind of data/ format/document does it need?
Thanks
回答1:
you have to add more than 5 samples from each category. because default cutoff mark size is 5,
Please refer this blog post http://madhawagunasekara.blogspot.com/2014/11/nlp-categorizer.html
回答2:
You can use the -cutoff flag in your DoccatTrainer command to change the default. In your case, you would add -cutoff 1 to set the minimum number of documents per category to 1.
来源:https://stackoverflow.com/questions/30168578/how-many-lines-and-documents-should-be-there-in-the-training-data-opennlp-catego