问题
I want to translate following routine from class [Word2VecModel]https://github.com/apache/spark/blob/branch-2.3/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala into pyspark.
override def transform(dataset: Dataset[_]): DataFrame = {
transformSchema(dataset.schema, logging = true)
val vectors = wordVectors.getVectors
.mapValues(vv => Vectors.dense(vv.map(_.toDouble)))
.map(identity) // mapValues doesn't return a serializable map (SI-7005)
val bVectors = dataset.sparkSession.sparkContext.broadcast(vectors)
val d = $(vectorSize)
val word2Vec = udf { sentence: Seq[String] =>
if (sentence.isEmpty) {
Vectors.sparse(d, Array.empty[Int], Array.empty[Double])
} else {
val sum = Vectors.zeros(d)
sentence.foreach { word =>
bVectors.value.get(word).foreach { v =>
BLAS.axpy(1.0, v, sum)
}
}
BLAS.scal(1.0 / sentence.size, sum)
sum
}
}
dataset.withColumn($(outputCol), word2Vec(col($(inputCol))))
}
Can some one help me how to convert this into pyspark equivalent code? I tried to do some of the portion in bits and pieces but not able to put whole.
Like I found BLAS.axpy() inner implementation which I can leverage for pyspark is
axpy(double a, Vector x, Vector y)
y += a * x
Same way for BLAS.scal() , the inner logic is
scal(double a, Vector x)
x = a * x
For scala idenity function I created same function in pyspark as pyspark dosen't have one.
def identity(x):
return x
I tried to convert following line
val vectors = wordVectors.getVectors
.mapValues(vv => Vectors.dense(vv.map(_.toDouble)))
.map(identity)
And I came up with this, not sure how to do vv.map(_.toDouble) in pyspark ? Is that right
vectors_final = model.getVectors().rdd.mapValues(lambda vv: Vectors.dense(vv)).map(lambda x: identity(x))
Thank you.
来源:https://stackoverflow.com/questions/61290999/convert-scala-code-to-pyspark-word2vec-scala-tranform-routine