Is there a way to get the “original” text data for OpenNLP?

隐身守侯 提交于 2019-12-05 07:28:29

问题


I know that this question was asked before - but the answer was not satisfying (in the sense of that the answer was just a link ).

So my question is, is there any way to extend the existing openNLP models? I already know about the technique with DBPedia/Wikipedia. But what if i just want to append some lines of text to improve the models - is there really no way? (If so - that would be really stupid...)


回答1:


Unfortunately, you can't. See this question which has a detailed answer to the same problem.

I think, that is a though problem because when you deal with texts you have often licensing issues. For example, you can not build a corpus on Twitter data and publish it to the community (see this paper for some more information).

Therefore, often companies build domain specific corpora and use them internally. For example, we did in our research project. Therefore, we built a tool (Quick Pad Tagger) to create annotated corpora efficiently (see here).




回答2:


Ok i think this needs a separate answer. I found the Yago database: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago//

This database seems to be just fantastic (from the first look). You can download all the tagged data and put it in a database (they already deliver the tools for that).

The next stage is to "refactor" the tagged entities so that opennlp can use it (openNLP uses sth. like this <START:person> Pierre Vinken <END>)

Then you create some text files and train it with the opennlp delivered training tool.

Not 100% sure if this works but i will come back and tell you.



来源:https://stackoverflow.com/questions/32668542/is-there-a-way-to-get-the-original-text-data-for-opennlp

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!