I have this data frame
df = sc.parallelize([(1, [1, 2, 3]), (1, [4, 5, 6]) , (2,[2]),(2,[3])]).toDF([\"store\", \"values\"]) +-----+---------+ |store| values|
Since PySpark 2.4, you can use the following code:
df = df.groupBy("store").agg(collect_list("values").alias("values")) df = df.select("store", array_sort(array_distinct(expr("reduce(values, array(), (x,y) -> concat(x, y))"))).alias("values"))