Run cvb in mahout 0.8

后端 未结 2 2023
谎友^
谎友^ 2021-02-04 20:22

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can

相关标签:
2条回答
  • 2021-02-04 21:08

    So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin folder.

    $MAHOUT_HOME/mahout seqdirectory \
        -i path/to/directory/with/texts \
        -o out/sequenced
    
    $MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
        -o out/sparseVectors \
        --namedVector \
        -wt tf
    
    $MAHOUT_HOME/mahout rowid \
        -i out/sparseVectors/tf-vectors/ \
        -o out/matrix
    
    $MAHOUT_HOME/mahout cvb0_local \
        -i out/matrix/matrix \
        -d out/sparseVectors/dictionary.file-0 \
        -a 0.5 \
        -top 4 -do out/cvb/do_out \
        -to out/cvb/to_out
    

    Inspect the output by showing the top 10 words of each topic:

    $MAHOUT_HOME/mahout vectordump \
        -i out/cvb/to_out \
        --dictionary out/sparseVectors/dictionary.file-0 \
        --dictionaryType sequencefile \
        --vectorSize 10 \
        -sort out/cvb/to_out
    
    0 讨论(0)
  • 2021-02-04 21:18

    Thanks to JoKnopp for the detail commands.

    If you get: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String

    you need to add the command line option "maxIterations": --maxIterations (-m) maxIterations

    I use -m 20 and it works

    refer to: https://issues.apache.org/jira/browse/MAHOUT-1141

    0 讨论(0)
提交回复
热议问题