问题
My question is related to Latent Dirichlet Allocation. Suppose we apply LDA on our dataset, then apply fit transform on that.
the output is a matrix that is a collection of five documents. Each document consists of three topics. othe output is below:
[[ 0.0922935 0.09218227 0.81552423]
[ 0.81396651 0.09409428 0.09193921]
[ 0.05265482 0.05240119 0.89494398]
[ 0.05278187 0.89455775 0.05266038]
[ 0.85209554 0.07338382 0.07452064]]
So, this is the matrix that will be sent to a classification method for an evaluation purpose.
For the classification part, we need the labels for each row. But we do not have the labels which means I have to create them by my own.
One approach could be getting the highest probability for each topic as the corresponding label.
For example, the labels may be like so:
[2,0,2,1,0,]
However, this is very simple example.
I can also consider two highest probability for each document if each documents only has two topics. So, the example would be like this:
[[ 0.0922935 0 0.81552423]
[ 0.81396651 0.09409428 0]
[ 0.05265482 0 0.89494398]
[ 0.05278187 0.89455775 0]
[ 0.85209554 0 0.07452064]]
As you can see I have the rule of keeping the same probability for each label if they have the highest probabilities.
Which approach is correct? Has anyone used any other approach that is more meaningful?
Many thanks in advance!
来源:https://stackoverflow.com/questions/45654463/concept-behind-the-transformed-data-of-lda-model