GroupBy and concat array columns pyspark

前端 未结 5 520
挽巷
挽巷 2021-01-31 20:27

I have this data frame

df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"])

+-----+---------+
|store|   values|         


        
5条回答
  •  执笔经年
    2021-01-31 20:56

    Since PySpark 2.4, you can use the following code:

    df = df.groupBy("store").agg(collect_list("values").alias("values"))
    
    df = df.select("store", array_sort(array_distinct(expr("reduce(values, array(), (x,y) -> concat(x, y))"))).alias("values"))
    

提交回复
热议问题