Collect rows as list with group by apache spark

前端 未结 2 909
深忆病人
深忆病人 2020-12-30 08:16

I have a particular use case where I have multiple rows for same customer where each row object looks like:

root
 -c1: BigInt
 -c2: String
 -c3: Double
 -c4:         


        
2条回答
  •  礼貌的吻别
    2020-12-30 08:55

    Instead of array you can use struct function to combine the columns and use groupBy and collect_list aggregation function as

    import org.apache.spark.sql.functions._
    df.withColumn("combined", struct("c1","c2","c3","c4","c5"))
        .groupBy("c1").agg(collect_list("combined").as("combined_list"))
        .show(false)
    

    so that you have grouped dataset with schema as

    root
     |-- c1: integer (nullable = false)
     |-- combined_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- c1: integer (nullable = false)
     |    |    |-- c2: string (nullable = true)
     |    |    |-- c3: string (nullable = true)
     |    |    |-- c4: string (nullable = true)
     |    |    |-- c5: map (nullable = true)
     |    |    |    |-- key: string
     |    |    |    |-- value: integer (valueContainsNull = false)
    

    I hope the answer is helpful

提交回复
热议问题