Is there a way to get the “original” text data for OpenNLP?

问题

I know that this question was asked before - but the answer was not satisfying (in the sense of that the answer was just a link ).

So my question is, is there any way to extend the existing openNLP models? I already know about the technique with DBPedia/Wikipedia. But what if i just want to append some lines of text to improve the models - is there really no way? (If so - that would be really stupid...)

回答1:

Unfortunately, you can't. See this question which has a detailed answer to the same problem.

I think, that is a though problem because when you deal with texts you have often licensing issues. For example, you can not build a corpus on Twitter data and publish it to the community (see this paper for some more information).

Therefore, often companies build domain specific corpora and use them internally. For example, we did in our research project. Therefore, we built a tool (Quick Pad Tagger) to create annotated corpora efficiently (see here).

回答2:

Ok i think this needs a separate answer. I found the Yago database: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago//

This database seems to be just fantastic (from the first look). You can download all the tagged data and put it in a database (they already deliver the tools for that).

The next stage is to "refactor" the tagged entities so that opennlp can use it (openNLP uses sth. like this <START:person> Pierre Vinken <END>)

Then you create some text files and train it with the opennlp delivered training tool.

Not 100% sure if this works but i will come back and tell you.

来源：https://stackoverflow.com/questions/32668542/is-there-a-way-to-get-the-original-text-data-for-opennlp

标签

java

nlp

opennlp