问题
I have used Spark Api (https://spark.apache.org/docs/latest/ml-features.html#tf-idf) for calculating TF IDF on a dataframe. What I am unable to do is to do it on grouped data using Dataframe groupBy and calculating TFIDF for each group and in the result getting single dataframe.
For Example for input
id | category | texts
0 | smallLetters | Array("a", "b", "c")
1 | smallLetters | Array("a", "b", "b", "c", "a")
2 | capitalLetters | Array("A", "B", "C")
3 | capitalLetters | Array("A", "B", "B", "c", "A)
Sample output for group by column "category"
id | category | texts | vector
0 | smallLetters | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | smallLetters | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
2 | capitalLetters | Array("A", "B", "C") | (3,[3,4,5],[1.0,1.0,1.0])
3 | captialLetters | Array("A", "B", "B", "c", "A) | (5, [3,4,2],[2.0,2.0,1.0])
Taking example from the spark website I am currently similar to this:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.setVocabSize(3)
.setMinDF(2)
.fit(df)
// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
.setInputCol("words")
.setOutputCol("features")
cvModel.transform(df).show(false)
Now the issue I am facing is how to calculate TF-IDF using the above code after doing a groupby operation on category.
Edit: I want to define the Corpus as the grouped data. That is smallLetters is one corpus and capitalLetters is another, so that for TF-IDF calculations smallLetters corpus contains 2 documents and capitalLetters contains 2 documents.
来源:https://stackoverflow.com/questions/45676390/how-to-calculate-tf-idf-on-grouped-spark-dataframe-in-scala