问题
I was running gensim
LdaMulticore
package for the topic modelling using Python.
I tried to understand the meaning of the parameters within LdaMulticore
and found the website that provides some explanations on the usage of the parameters. As a non-expert, I have some difficulty understanding these intuitively. I also referred some other materials from the website but I guess this page gives relatively full explanations of every parameters.
This page
chunksize
Number of documents to be used in each training chunk.
->Does it mean that it determines how many documents to be analyzed (trained) at once?
Does changing thechunksize
number generate significantly different outcomes? or does it just matter to the running time?
2.alpha
, eta
, decay
->I kept reading the explanations but couldn't understand these at all.
Could someone give me some intuitive explanations on what these are about/when do I need to adjust these?
3.iteration
Maximum number of iterations through the corpus when inferring the topic distribution of a corpus.
->It seems that Python goes over n times of the entire corpus when I set it to n. So the higher the number, the more data is analyzed but takes longer time.
4.random state
Either a randomState
object or a seed to generate one. Useful for reproducibility.
->I've seen people setting up this by putting a random number. But what is random state about?
回答1:
I am wondering if you saw this answer? There I provide some explanation regarding chunksize
and alpha
. This blog post has practical tips and can be of help too.
In short:
chunksize
- how many documents are loaded into memory while calculating "expectation" step before updating the model. Each "expectation" step of Expectation Maximization algorithm takes into account this number of documents at once and updates the matrix only after it finishes the calculation on the "chunk". Size of the chunk determines the performance of the process - the more documents in memory at once - the better. Overly small chunks also impact numerical accuracy, particularly for a very large number of documents.
alpha
, eta
, decay
- these are strictly linked to the LDA algorithm and there are no "intuitive explanations" unless you have a grasp of the algorithm which requires some understanding of Bayesian methods, Expectation Maximization in particular.
iteration
- you are not correct. The higher the number the more times the algorithm goes through the whole set of documents. So there is no "more data". It is only the corpus you provide, only iterated over more times.
random_state
- this serves as a seed (in case you wanted to repeat exactly the training process it is enough that you set the seed to the same value and you are going to receive the same model on the same data + other parameters). This is useful when you care about reproducibility.
来源:https://stackoverflow.com/questions/65014553/how-to-tune-the-parameters-for-gensim-ldamulticore-in-python