how do I preserve the key or index of input to Spark HashingTF() function?

青春壹個敷衍的年華 提交于 2019-12-10 16:57:50

问题


Based on the Spark documentation for 1.4 (https://spark.apache.org/docs/1.4.0/mllib-feature-extraction.html) I'm writing a TF-IDF example for converting text documents to vectors of values. The example given shows how this can be done but the input is a RDD of tokens with no keys. This means that my output RDD no longer contains an index or key to refer back to the original document. The example is this:

documents = sc.textFile("...").map(lambda line: line.split(" "))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

I would like to do something like this:

documents = sc.textFile("...").map(lambda line: (UNIQUE_LINE_KEY, line.split(" ")))

hashingTF = HashingTF()
tf = hashingTF.transform(documents)

and have the resulting tf variable contain the UNIQUE_LINE_KEY value somewhere. Am I just missing something obvious? From the examples it appears there is no good way to link the document RDD with the tf RDD.


回答1:


If you use a version of Spark from after commit 85b96372cf0fd055f89fc639f45c1f2cb02a378f (this includes the 1.4), and use the ml API HashingTF (requires DataFrame input instead of plain RDDs), the original columns in its output. Hope that helps!




回答2:


I also encountered the same issue. In the example from the docs they encourage you to apply the transformations directly on the RDD.

However, you can apply the transformations on the vectors themselves and this way you can keep the keys whichever way you choose.

val input = sc.textFile("...")
val documents = input.map(doc => doc -> doc.split(" ").toSeq)

val hashingTF = new HashingTF()
val tf = documents.mapValues(hashingTF.transform(_))
tf.cache()
val idf = new IDF().fit(tf.values)
val tfidf = tf.mapValues(idf.transform(_))

Note that this code will yield RDD[(String, Vector)] instead of RDD[Vector]



来源:https://stackoverflow.com/questions/31151163/how-do-i-preserve-the-key-or-index-of-input-to-spark-hashingtf-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!