how do I preserve the key or index of input to Spark HashingTF() function?
问题 Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys . This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this: documents = sc.textFile("...").map(lambda line: line.split(" ")) hashingTF = HashingTF()