How to update column based on a condition (a value in a group)?

后端 未结 5 895
猫巷女王i
猫巷女王i 2021-02-07 18:16

I have the following df:

+---+----+-----+
|sno|dept|color|
+---+----+-----+
|  1|  fn|  red|
|  2|  fn| blue|
|  3|  fn|green|
+---+----+-----+
<
5条回答
  •  梦谈多话
    2021-02-07 18:28

    Given:

    val df = Seq(
      (1, "fn", "red"),
      (2, "fn", "blue"),
      (3, "fn", "green"),
      (4, "aa", "blue"),
      (5, "aa", "green"),
      (6, "bb", "red"),
      (7, "bb", "red"),
      (8, "aa", "blue")
    ).toDF("id", "fn", "color")
    

    Do the calculation:

    val redOrNot = df.groupBy("fn")
      .agg(collect_set('color) as "values")
      .withColumn("hasRed", array_contains('values, "red"))
    
    // gives null for no option
    val colorPicker = when('hasRed, "red")
    val result = df.join(redOrNot, "fn")
      .withColumn("resultColor", colorPicker) 
      .withColumn("color", coalesce('resultColor, 'color)) // skips nulls that leads to the answer
      .select('id, 'fn, 'color)
    

    The result looks as follows (that seems to be an answer):

    scala> result.show
    +---+---+-----+
    | id| fn|color|
    +---+---+-----+
    |  1| fn|  red|
    |  2| fn|  red|
    |  3| fn|  red|
    |  4| aa| blue|
    |  5| aa|green|
    |  6| bb|  red|
    |  7| bb|  red|
    |  8| aa| blue|
    +---+---+-----+
    

    You can chain when operators and have a default value with otherwise. Consult the scaladoc of when operator.

    I think you could do something very similar (and perhaps more efficient) using windowed operators or user-defined aggregate functions (UDAF), but...well...don't currently know how to do it. Leaving the comment here to inspire others ;-)

    p.s. Learnt a lot! Thanks for the idea!

提交回复
热议问题