LDA topic modeling - Training and testing

后端 未结 2 647
死守一世寂寞
死守一世寂寞 2021-01-30 09:18

I have read LDA and I understand the mathematics of how the topics are generated when one inputs a collection of documents.

References say that LDA is an algorithm which

2条回答
  •  花落未央
    2021-01-30 09:24

    The general rule that using the training data for evaluation might be subject to overfitting also applies to unsupervised learning like LDA -- though it is not as obvious. LDA optimizes some objective, ie. generative probability, on the training data. It might be that in the training data two words are indicative of a topic, say "white house" for US politics. Assume the two words only occur once (in the training data). Then any algorithm fully relying on the assumption that they indicate only politics and nothing else would be doing great if you evaluated on the training data. However, if there are other topics like "architecture" then you might question, whether this is really the right thing to learn. Having a test data set can solve this issue to some extend:

    1. Since the relationship "white house" seems rare in the training data, it likely does not occur at all in the test data. If so, the evaluation shows how much your model relies on spurious relationships that might in fact not be helpful compared to more general ones.
    2. "White house" occurs in the test data, say it occurs once for "US politics" and once in a document on architecture. Then the assumption that it only indicates "US politics" is too strong and performance metrics will be worse, showing that your model is overfitting.

提交回复
热议问题