Preparing data for LDA in spark
I'm working on implementing a Spark LDA model (via the Scala API), and am having trouble with the necessary formatting steps for my data. My raw data (stored in a text file) is in the following format, essentially a list of tokens and the documents they correspond to. A simplified example: doc XXXXX term XXXXX 1 x 'a' x 1 x 'a' x 1 x 'b' x 2 x 'b' x 2 x 'd' x ... Where the XXXXX columns are garbage data I don't care about. I realize this is an atypical way of storing corpus data, but it's what I have. As is I hope is clear from the example, there's one line per token in the raw data (so if a