The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can
So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out
Thanks to JoKnopp for the detail commands.
If you get: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations": --maxIterations (-m) maxIterations
I use -m 20 and it works
refer to: https://issues.apache.org/jira/browse/MAHOUT-1141