问题
what is the meaning of cut-off
and iteration
for training in OpenNLP? or for that matter natural language processing. I need just a layman explanation of these terms. As far as I think, iteration is the number of times the algorithm is repeated and cut off is a value such that if a text has value above this cut off for some specific category it will get mapped to that category. Am I right?
回答1:
Correct, the term iteration refers to the general notion of iterative algorithms, where one sets out to solve a problem by successively producing (hopefully increasingly more accurate) approximations of some "ideal" solution. Generally speaking, the more iterations, the more accurate ("better") the result will be, but of course the more computational steps have to be carried out.
The term cutoff (aka cutoff frequency) is used to designate a method of reducing the size of n-gram language models (as used by OpenNLP, e.g. its part-of-speech tagger). Consider the following example:
Sentence 1 = "The cat likes mice."
Sentence 2 = "The cat likes fish."
Bigram model = {"the cat" : 2, "cat likes" : 2, "likes mice" : 1, "likes fish" : 1}
If you set the cutoff frequency to 1 for this example, the n-gram model would be reduced to
Bigram model = {"the cat" : 2, "cat likes" : 2}
That is, the cutoff method removes from the language model those n-grams that occur infrequently in the training data. Reducing the size of n-gram language models is sometimes necessary, as the number of even bigrams (let alone trigrams, 4-grams, etc.) explodes for larger corpora. The remaning information (n-gram counts) can then be used to statistically estimate the probability of a word (or its POS tag) given the (n-1) previous words (or POS tags).
回答2:
In context to Apache OpenNLP library, We can specifically take example of document categorization for review comments as given here.
positive I love this. I like this. I really love this product. We like this.
negative I hate this. I dislike this. We absolutely hate this. I really hate this product.
Cut off value is used to avoid words as feature whose counts are less than cut off. If cut off was more than 2, then word “love” might not be considered as feature & we might get wrong results. Generally cut off value is useful to avoid creating unnecessary features for words which rarely occur. Detailed example with further explanation can be found here in this article.
来源:https://stackoverflow.com/questions/30238014/what-is-the-meaning-of-cut-off-and-iteration-for-trainings-in-opennlp