What is the meaning of 'cut-off' and 'iteration' for trainings in OpenNLP?

泪湿孤枕 提交于 2019-12-08 16:53:38

问题


what is the meaning of cut-off and iteration for training in OpenNLP? or for that matter natural language processing. I need just a layman explanation of these terms. As far as I think, iteration is the number of times the algorithm is repeated and cut off is a value such that if a text has value above this cut off for some specific category it will get mapped to that category. Am I right?


回答1:


Correct, the term iteration refers to the general notion of iterative algorithms, where one sets out to solve a problem by successively producing (hopefully increasingly more accurate) approximations of some "ideal" solution. Generally speaking, the more iterations, the more accurate ("better") the result will be, but of course the more computational steps have to be carried out.

The term cutoff (aka cutoff frequency) is used to designate a method of reducing the size of n-gram language models (as used by OpenNLP, e.g. its part-of-speech tagger). Consider the following example:

Sentence 1 = "The cat likes mice."
Sentence 2 = "The cat likes fish."
Bigram model = {"the cat" : 2, "cat likes" : 2, "likes mice" : 1, "likes fish" : 1}

If you set the cutoff frequency to 1 for this example, the n-gram model would be reduced to

Bigram model = {"the cat" : 2, "cat likes" : 2}

That is, the cutoff method removes from the language model those n-grams that occur infrequently in the training data. Reducing the size of n-gram language models is sometimes necessary, as the number of even bigrams (let alone trigrams, 4-grams, etc.) explodes for larger corpora. The remaning information (n-gram counts) can then be used to statistically estimate the probability of a word (or its POS tag) given the (n-1) previous words (or POS tags).




回答2:


In context to Apache OpenNLP library, We can specifically take example of document categorization for review comments as given here.

positive     I love this. I like this. I really love this product. We like this.
negative     I hate this. I dislike this. We absolutely hate this. I really hate this product.

Cut off value is used to avoid words as feature whose counts are less than cut off. If cut off was more than 2, then word “love” might not be considered as feature & we might get wrong results. Generally cut off value is useful to avoid creating unnecessary features for words which rarely occur. Detailed example with further explanation can be found here in this article.



来源:https://stackoverflow.com/questions/30238014/what-is-the-meaning-of-cut-off-and-iteration-for-trainings-in-opennlp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!