Convert Scala code to Pyspark :Word2Vec Scala Tranform Routine

问题

I want to translate following routine from class [Word2VecModel]https://github.com/apache/spark/blob/branch-2.3/mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala into pyspark.

  override def transform(dataset: Dataset[_]): DataFrame = {
        transformSchema(dataset.schema, logging = true)
        val vectors = wordVectors.getVectors
          .mapValues(vv => Vectors.dense(vv.map(_.toDouble)))
          .map(identity) // mapValues doesn't return a serializable map (SI-7005)
        val bVectors = dataset.sparkSession.sparkContext.broadcast(vectors)
        val d = $(vectorSize)
        val word2Vec = udf { sentence: Seq[String] =>
          if (sentence.isEmpty) {
            Vectors.sparse(d, Array.empty[Int], Array.empty[Double])
          } else {
            val sum = Vectors.zeros(d)
            sentence.foreach { word =>
              bVectors.value.get(word).foreach { v =>
                BLAS.axpy(1.0, v, sum)
              }
            }
            BLAS.scal(1.0 / sentence.size, sum)
            sum
          }
        }
        dataset.withColumn($(outputCol), word2Vec(col($(inputCol))))
      }

Can some one help me how to convert this into pyspark equivalent code? I tried to do some of the portion in bits and pieces but not able to put whole.

Like I found BLAS.axpy() inner implementation which I can leverage for pyspark is

axpy(double a, Vector x, Vector y)
    y += a * x

Same way for BLAS.scal() , the inner logic is

scal(double a, Vector x)
    x = a * x

For scala idenity function I created same function in pyspark as pyspark dosen't have one.

   def identity(x):
        return x

I tried to convert following line

 val vectors = wordVectors.getVectors
      .mapValues(vv => Vectors.dense(vv.map(_.toDouble)))
      .map(identity)

And I came up with this, not sure how to do vv.map(_.toDouble) in pyspark ? Is that right

  vectors_final = model.getVectors().rdd.mapValues(lambda vv: Vectors.dense(vv)).map(lambda x: identity(x))

Thank you.

来源：https://stackoverflow.com/questions/61290999/convert-scala-code-to-pyspark-word2vec-scala-tranform-routine

标签

scala

pyspark

nlp

vectorization

data-science