How many lines and documents should be there in the training data opennlp categorizer

无人久伴 提交于 2019-12-14 03:59:13

问题


I am following the documentation for Apache open-nlp. I was able to understand the sentence detection, tokenizer, name-finder. But I got stuck for Categorizer. The reason, I can not understand, how to create a model for Categorization.

I do understand that I need to create a file. The format is very clear, it needs to be a category space and a document in a single line. Save the file with .train extension.

So I created the following file:

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?

I gave this command-

opennlp DoccatTrainer -model en-doccat.bin -lang en -data en-doccat.train -encoding UTF-8

It starts doing something and then returns with an error. These are the contents in the command prompt:

Indexing events using cutoff of 5

    Computing event counts...  done. 2 events
    Indexing...  Dropped event Refund:[bow=What, bow=is, bow=the, bow=refund, bow=status, bow=for, bow=my, bow=order, bow=#342, bow=?]
Dropped event NewOffers:[bow=Are, bow=there, bow=any, bow=new, bow=offers, bow=for, bow=your, bow=products, bow=?]
done.
Sorting and merging events... Done indexing.
Incorporating indexed data for training...  
Exception in thread "main" java.lang.NullPointerException
    at opennlp.maxent.GISTrainer.trainModel(GISTrainer.java:263)
    at opennlp.maxent.GIS.trainModel(GIS.java:256)
    at opennlp.model.TrainUtil.train(TrainUtil.java:184)
    at opennlp.tools.doccat.DocumentCategorizerME.train(DocumentCategorizerME.java:162)
    at opennlp.tools.cmdline.doccat.DoccatTrainerTool.run(DoccatTrainerTool.java:61)
    at opennlp.tools.cmdline.CLI.main(CLI.java:222)

I am just not able to figure out why is this giving a null pointer exception here? I also tried to increase two more lines, but no result.

Refund What is the refund status for my order #342 ?
NewOffers Are there any new offers for your products ?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ?  

I found this blog, but here also pretty much the same thing is done. On trying his training file it works with a charm. What is wrong in my file? How do I resolve the error.

When I try opennlp DoccatTrainer it opens help for me, so path is not an issue. Any help is appreciated.

EDIT: I changed the file to

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products?
Refund Can I place a refund request for electronics ?
NewOffers Is there any new offer on buying worth 5000 ? 

and it works, I thought it has got to do something with the document (apparently should be two sentences) and removed the last two lines.

to make it

Refund What is the refund status for my order #342 ? Can I place a refund request for clothes ?
NewOffers Are there any new offers for your products ? what are the offers on new products or new offers on old products? 

But then again it fails, the question now summarizes to what kind of data/ format/document does it need?

Thanks


回答1:


you have to add more than 5 samples from each category. because default cutoff mark size is 5,

Please refer this blog post http://madhawagunasekara.blogspot.com/2014/11/nlp-categorizer.html




回答2:


You can use the -cutoff flag in your DoccatTrainer command to change the default. In your case, you would add -cutoff 1 to set the minimum number of documents per category to 1.



来源:https://stackoverflow.com/questions/30168578/how-many-lines-and-documents-should-be-there-in-the-training-data-opennlp-catego

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!