How to get word details from TF Vector RDD in Spark ML Lib?

后端 未结 1 1571
自闭症患者
自闭症患者 2020-11-28 12:05

I have created Term Frequency using HashingTF in Spark. I have got the term frequencies using tf.transform for each word.

But the results a

相关标签:
1条回答
  • 2020-11-28 12:29

    Well, you can't. Since hashing is non-injective there is no inverse function. In other words infinite number of tokens can map to a single bucket so it is impossible to tell which one is actually there.

    If you're using a large hash and number of unique tokens is relatively low then you can try to create a lookup table from bucket to possible tokens from your dataset. It is one-to-many mapping but if above conditions are met number of conflicts should be relatively low.

    If you need a reversible transformation you can use combine Tokenizer and StringIndexer and build a sparse feature vector manually.

    See also: What hashing function does Spark use for HashingTF and how do I duplicate it?

    Edit:

    In Spark 1.5+ (PySpark 1.6+) you can use CountVectorizer which applies reversible transformation and stores vocabulary.

    Python:

    from pyspark.ml.feature import CountVectorizer
    
    df = sc.parallelize([
        (1, ["foo", "bar"]), (2, ["foo", "foobar", "baz"])
    ]).toDF(["id", "tokens"])
    
    vectorizer = CountVectorizer(inputCol="tokens", outputCol="features").fit(df)
    vectorizer.vocabulary
    ## ('foo', 'baz', 'bar', 'foobar')
    

    Scala:

    import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
    
    val df = sc.parallelize(Seq(
        (1, Seq("foo", "bar")), (2, Seq("foo", "foobar", "baz"))
    )).toDF("id", "tokens")
    
    val model: CountVectorizerModel = new CountVectorizer()
      .setInputCol("tokens")
      .setOutputCol("features")
      .fit(df)
    
    model.vocabulary
    // Array[String] = Array(foo, baz, bar, foobar)
    

    where element at the 0th position corresponds to index 0, element at the 1st position to index 1 and so on.

    0 讨论(0)
提交回复
热议问题