Merge Multiple Records in a Dataframe based on a key in scala spark

前端 未结 1 382
醉梦人生
醉梦人生 2021-01-26 10:47

I have a dataframe which contains records identified by a key. But there might be a case where a key can get repetitive. My goal is to merge all the records based on that key as

相关标签:
1条回答
  • 2021-01-26 11:03

    If you know there is only one record for group which is not null (or you don't care which one you'll get), you can use first:

    import org.apache.spark.sql.functions.{first, last}
    
    val df = Seq(
      ("a", Some(1), None, None), ("a", None, Some(2), None),
      ("a", None, None, Some(3))
    ).toDF("key", "value1", "value2", "value3")
    
    df.groupBy("key").agg(
      first("value1", true) as "value1", 
      first("value2", true) as "value2", 
      first("value3", true) as "value3"
    ).show  
    
    // +---+------+------+------+
    // |key|value1|value2|value3|
    // +---+------+------+------+
    // |  a|     1|     2|     3|
    // +---+------+------+------+
    

    or last:

    df.groupBy("key").agg(
      last("value1", true) as "value1", 
      last("value2", true) as "value2", 
      last("value3", true) as "value3"
    ).show  
    
    
    // +---+------+------+------+
    // |key|value1|value2|value3|
    // +---+------+------+------+
    // |  a|     1|     2|     3|
    // +---+------+------+------+    
    
    0 讨论(0)
提交回复
热议问题