How to compare multiple rows?

后端 未结 1 716
灰色年华
灰色年华 2021-01-05 11:19

I\'d like to compare two consecutive rows i with i-1 of col2 (sorted by col1).

If item_i of the

相关标签:
1条回答
  • 2021-01-05 11:55

    You can use a combination of a window function and an aggregate to do this. The window function is used to get the next value of col2 (using col1 for ordering). The aggregate then counts the times we encounter a differences. This is implemented in the code below:

    val data = Seq(
      ("row_1", "item_1"),
      ("row_2", "item_1"),
      ("row_3", "item_2"),
      ("row_4", "item_1"),
      ("row_5", "item_2"),
      ("row_6", "item_1")).toDF("col1", "col2")
    
    import org.apache.spark.sql.expressions.Window
    val q = data.
      withColumn("col2_next",
        coalesce(lead($"col2", 1) over Window.orderBy($"col1"), $"col2")).
      groupBy($"col2").
      agg(sum($"col2" =!= $"col2_next" cast "int") as "col3")
    
    scala> q.show
    17/08/22 10:15:53 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
    +------+----+
    |  col2|col3|
    +------+----+
    |item_1|   2|
    |item_2|   2|
    +------+----+
    
    0 讨论(0)
提交回复
热议问题