How to calculate TF-IDF on grouped spark dataframe in scala?

拈花ヽ惹草 提交于 2020-01-24 17:32:09

问题


I have used Spark Api (https://spark.apache.org/docs/latest/ml-features.html#tf-idf) for calculating TF IDF on a dataframe. What I am unable to do is to do it on grouped data using Dataframe groupBy and calculating TFIDF for each group and in the result getting single dataframe.

For Example for input

id |  category        | texts                           
 0 |  smallLetters    | Array("a", "b", "c")            
 1 |  smallLetters    | Array("a", "b", "b", "c", "a")  
 2 |  capitalLetters  | Array("A", "B", "C")
 3 |  capitalLetters  | Array("A", "B", "B", "c", "A)

Sample output for group by column "category"

id | category       | texts                           | vector
0  | smallLetters   | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
1  | smallLetters   | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])
2  | capitalLetters | Array("A", "B", "C")            | (3,[3,4,5],[1.0,1.0,1.0])
3  | captialLetters | Array("A", "B", "B", "c", "A)   | (5, [3,4,2],[2.0,2.0,1.0])

Taking example from the spark website I am currently similar to this:

import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c")),
  (1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")

// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
  .setVocabSize(3)
  .setMinDF(2)
  .fit(df)

// alternatively, define CountVectorizerModel with a-priori vocabulary
val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features")

cvModel.transform(df).show(false)

Now the issue I am facing is how to calculate TF-IDF using the above code after doing a groupby operation on category.

Edit: I want to define the Corpus as the grouped data. That is smallLetters is one corpus and capitalLetters is another, so that for TF-IDF calculations smallLetters corpus contains 2 documents and capitalLetters contains 2 documents.

来源:https://stackoverflow.com/questions/45676390/how-to-calculate-tf-idf-on-grouped-spark-dataframe-in-scala

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!