How to tune the parameters for gensim `LdaMulticore` in Python

夙愿已清 提交于 2021-01-29 08:24:41

问题


I was running gensim LdaMulticore package for the topic modelling using Python. I tried to understand the meaning of the parameters within LdaMulticore and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters.
This page

  1. chunksize Number of documents to be used in each training chunk.
    ->Does it mean that it determines how many documents to be analyzed (trained) at once?
    Does changing the chunksize number generate significantly different outcomes? or does it just matter to the running time?

2.alpha, eta, decay
->I kept reading the explanations but couldn't understand these at all.
Could someone give me some intuitive explanations on what these are about/when do I need to adjust these?

3.iteration
Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
->It seems that Python goes over n times of the entire corpus when I set it to n. So the higher the number, the more data is analyzed but takes longer time.

4.random state
Either a randomState object or a seed to generate one. Useful for reproducibility.
->I've seen people setting up this by putting a random number. But what is random state about?


回答1:


I am wondering if you saw this answer? There I provide some explanation regarding chunksize and alpha. This blog post has practical tips and can be of help too.

In short: chunksize - how many documents are loaded into memory while calculating "expectation" step before updating the model. Each "expectation" step of Expectation Maximization algorithm takes into account this number of documents at once and updates the matrix only after it finishes the calculation on the "chunk". Size of the chunk determines the performance of the process - the more documents in memory at once - the better. Overly small chunks also impact numerical accuracy, particularly for a very large number of documents.

alpha, eta, decay - these are strictly linked to the LDA algorithm and there are no "intuitive explanations" unless you have a grasp of the algorithm which requires some understanding of Bayesian methods, Expectation Maximization in particular.

iteration - you are not correct. The higher the number the more times the algorithm goes through the whole set of documents. So there is no "more data". It is only the corpus you provide, only iterated over more times.

random_state - this serves as a seed (in case you wanted to repeat exactly the training process it is enough that you set the seed to the same value and you are going to receive the same model on the same data + other parameters). This is useful when you care about reproducibility.



来源:https://stackoverflow.com/questions/65014553/how-to-tune-the-parameters-for-gensim-ldamulticore-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!