Spark LDA woes - prediction and OOM questions
I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built a 100K doc/600K feature/250 topic/100 iteration model using the Distributed model/EM optimizer. The model built fine and the resulting topics were coherent. I then wrote a wrapper around the new single-document prediction routine (SPARK-10809; which I cherry picked into a custom Spark 1.6.0-based distribution) to get topics for new, unseen documents