Scale Matrix in Scala/Spark

后端未结

关注

 1  1956

I had the following list

id1, column_index1, value1
id2, column_index2, value2
...

which I transformed to a indexed row matrix doing the follow

相关标签:

1条回答

孤街浪徒

2021-01-22 18:38

According to what I understood from your question, here is what you'll need to do to perform StandardScaler fit on your IndexedRow

import org.apache.spark.mllib.feature.{StandardScaler, StandardScalerModel}
import org.apache.spark.mllib.linalg.distributed.IndexedRow
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.rdd.RDD

val data: RDD[(Int, Int, Double)] = ???

object nCol {
  val value: Int = ???
}

val data_mapped: RDD[(Int, (Int, Double))] = 
    data.map({ case (id, col, score) => (id, (col, score)) })
val data_mapped_grouped: RDD[(Int, Iterable[(Int, Double)])] = 
    data_mapped.groupByKey

val indexed_rows: RDD[IndexedRow] = data_mapped_grouped.map { 
       case (id, vals) => 
       IndexedRow(id, Vectors.sparse(nCol.value, vals.toSeq)) 
}

You can get your vectors from your IndexedRow with a simple map

val vectors: RDD[Vector] = indexed_rows.map { case i: IndexedRow => i.vector }

Now that you have an RDD[Vector] you can try to fit it with your scaler.

val scaler: StandardScalerModel = new StandardScaler().fit(vectors)

I hope this helps!

0 讨论(0)