Run cvb in mahout 0.8

送分小仙女□ 提交于 2019-12-20 10:57:13

问题


The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.

Thus, I want to:

  • preprocess some texts correctly
  • run the cvb0_local version of cvb
  • inspect the results by looking at the top n words in each of the generated topics

回答1:


So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin folder.

$MAHOUT_HOME/mahout seqdirectory \
    -i path/to/directory/with/texts \
    -o out/sequenced

$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
    -o out/sparseVectors \
    --namedVector \
    -wt tf

$MAHOUT_HOME/mahout rowid \
    -i out/sparseVectors/tf-vectors/ \
    -o out/matrix

$MAHOUT_HOME/mahout cvb0_local \
    -i out/matrix/matrix \
    -d out/sparseVectors/dictionary.file-0 \
    -a 0.5 \
    -top 4 -do out/cvb/do_out \
    -to out/cvb/to_out

Inspect the output by showing the top 10 words of each topic:

$MAHOUT_HOME/mahout vectordump \
    -i out/cvb/to_out \
    --dictionary out/sparseVectors/dictionary.file-0 \
    --dictionaryType sequencefile \
    --vectorSize 10 \
    -sort out/cvb/to_out



回答2:


Thanks to JoKnopp for the detail commands.

If you get: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String

you need to add the command line option "maxIterations": --maxIterations (-m) maxIterations

I use -m 20 and it works

refer to: https://issues.apache.org/jira/browse/MAHOUT-1141



来源:https://stackoverflow.com/questions/14757162/run-cvb-in-mahout-0-8

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!