Spark Scala - How to group dataframe rows and apply complex function to the groups?

前端 未结 1 1745
盖世英雄少女心
盖世英雄少女心 2020-12-29 00:07

i am trying to solve this super simple problem and i am already sick of it, I hope somebody can help my out with this. I have a dataframe of shape like this:

----         


        
相关标签:
1条回答
  • 2020-12-29 00:40

    Cosine similarity is not a complex function and can expressed using standard SQL aggregations. Let's consider following example:

    val df = Seq(
      ("feat1", 1.0, "item1"),
      ("feat2", 1.0, "item1"),
      ("feat6", 1.0, "item1"),
      ("feat1", 1.0, "item2"),
      ("feat3", 1.0, "item2"),
      ("feat4", 1.0, "item3"),
      ("feat5", 1.0, "item3"),
      ("feat1", 1.0, "item4"),
      ("feat6", 1.0, "item4")
    ).toDF("feature", "value", "item")
    

    where feature is a feature identifier, value is a corresponding value and item is object identifier and feature, item pair has only one corresponding value.

    Cosine similarity is defined as:

    where numerator can be computed as:

    val numer = df.as("this").withColumnRenamed("item", "this")
      .join(df.as("other").withColumnRenamed("item", "other"), Seq("feature"))
      .where($"this" < $"other")
      .groupBy($"this", $"other")
      .agg(sum($"this.value" * $"other.value").alias("dot"))
    

    and norms used in the denominator as:

    import org.apache.spark.sql.functions.sqrt
    
    val norms = df.groupBy($"item").agg(sqrt(sum($"value" * $"value")).alias("norm"))
    

    // Combined together:

    val cosine = ($"dot" / ($"this_norm.norm" * $"other_norm.norm")).as("cosine") 
    
    val similarities = numer
     .join(norms.alias("this_norm").withColumnRenamed("item", "this"), Seq("this"))
     .join(norms.alias("other_norm").withColumnRenamed("item", "other"), Seq("other"))
     .select($"this", $"other", cosine)
    

    with result representing non-zero entries of the upper triangular matrix ignoring diagonal (which is trivial):

    +-----+-----+-------------------+
    | this|other|             cosine|
    +-----+-----+-------------------+
    |item1|item4| 0.8164965809277259|
    |item1|item2|0.40824829046386296|
    |item2|item4| 0.4999999999999999|
    +-----+-----+-------------------+
    

    This should be equivalent to:

    import org.apache.spark.sql.functions.array
    import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
    import org.apache.spark.mllib.linalg.Vectors
    
    val pivoted = df.groupBy("item").pivot("feature").sum()
      .na.fill(0.0)
      .orderBy("item")
    
    val mat = new IndexedRowMatrix(pivoted
      .select(array(pivoted.columns.tail.map(col): _*))
      .rdd
      .zipWithIndex
      .map {
        case (row, idx) => 
          new IndexedRow(idx, Vectors.dense(row.getSeq[Double](0).toArray))
      })
    
    mat.toCoordinateMatrix.transpose
      .toIndexedRowMatrix.columnSimilarities
      .toBlockMatrix.toLocalMatrix
    
    0.0  0.408248290463863  0.0  0.816496580927726
    0.0  0.0                0.0  0.4999999999999999
    0.0  0.0                0.0  0.0
    0.0  0.0                0.0  0.0
    

    Regarding your code:

    • Execution is sequential because your code operates on local (collected) collection.
    • myComplexFunction cannot be further distributed because it is distributed data structures and contexts.
    0 讨论(0)
提交回复
热议问题