Spark - correlation matrix from file of ratings

后端 未结 1 1864
广开言路
广开言路 2021-01-23 10:55

I\'m pretty new to Scala and Spark and I\'m not able to create a correlation matrix from a file of ratings. It\'s similar to this question but I have sparse data in the matrix f

相关标签:
1条回答
  • 2021-01-23 11:26

    I believe this code should accomplish what you want:

    import org.apache.spark.mllib.stat.Statistics
    import org.apache.spark.mllib.linalg._
    ...
    val corTest = input.map { case (line: String) => 
      val split = line.split(",").drop(1)
      split.map(elem => if (elem.trim.isEmpty) 0.0 else elem.toDouble)
    }.map(arr => Vectors.dense(arr))
    
    val corrMatrix = Statistics.corr(corTest)
    

    Here, we are mapping your input into a String array, dropping the user id element, zeroing out your whitespace, and finally creating a dense vector from the resultant array. Also, note that Pearson's method is used by default if no method is supplied.

    When run in shell with some examples, I see the following:

    scala> val input = sc.parallelize(Array("123, , , 3, , 4.5", "456, 1, 2, 3, , 4", "789, 4, 2.5, , 0.5, 4", "000, 5, 3.5, , 4.5, "))
    input: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:16
    
    scala> val corTest = ...
    corTest: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = MappedRDD[20] at map at <console>:18
    
    scala> val corrMatrix = Statistics.corr(corTest)
    ...
    corrMatrix: org.apache.spark.mllib.linalg.Matrix =
    1.0                  0.9037378388935388   -0.9701425001453317  ... (5 total)
    0.9037378388935388   1.0                  -0.7844645405527361  ...
    -0.9701425001453317  -0.7844645405527361  1.0                  ...
    0.7709910794438823   0.7273340668525836   -0.6622661785325219  ...
    -0.7513578452729373  -0.7560667258329613  0.6195855517393626   ...
    
    0 讨论(0)
提交回复
热议问题