Scale Matrix in Scala/Spark

后端 未结 1 1956
既然无缘
既然无缘 2021-01-22 18:09

I had the following list

id1, column_index1, value1
id2, column_index2, value2
...

which I transformed to a indexed row matrix doing the follow

相关标签:
1条回答
  • 2021-01-22 18:38

    According to what I understood from your question, here is what you'll need to do to perform StandardScaler fit on your IndexedRow

    import org.apache.spark.mllib.feature.{StandardScaler, StandardScalerModel}
    import org.apache.spark.mllib.linalg.distributed.IndexedRow
    import org.apache.spark.mllib.linalg.{Vector, Vectors}
    import org.apache.spark.rdd.RDD
    
    val data: RDD[(Int, Int, Double)] = ???
    
    object nCol {
      val value: Int = ???
    }
    
    val data_mapped: RDD[(Int, (Int, Double))] = 
        data.map({ case (id, col, score) => (id, (col, score)) })
    val data_mapped_grouped: RDD[(Int, Iterable[(Int, Double)])] = 
        data_mapped.groupByKey
    
    val indexed_rows: RDD[IndexedRow] = data_mapped_grouped.map { 
           case (id, vals) => 
           IndexedRow(id, Vectors.sparse(nCol.value, vals.toSeq)) 
    }
    

    You can get your vectors from your IndexedRow with a simple map

    val vectors: RDD[Vector] = indexed_rows.map { case i: IndexedRow => i.vector }
    

    Now that you have an RDD[Vector] you can try to fit it with your scaler.

    val scaler: StandardScalerModel = new StandardScaler().fit(vectors)
    

    I hope this helps!

    0 讨论(0)
提交回复
热议问题