I have a particular use case where I have multiple rows for same customer where each row object looks like:
root
-c1: BigInt
-c2: String
-c3: Double
-c4:
Instead of array
you can use struct
function to combine the columns and use groupBy
and collect_list
aggregation function as
import org.apache.spark.sql.functions._
df.withColumn("combined", struct("c1","c2","c3","c4","c5"))
.groupBy("c1").agg(collect_list("combined").as("combined_list"))
.show(false)
so that you have grouped dataset with schema
as
root
|-- c1: integer (nullable = false)
|-- combined_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: integer (nullable = false)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
| | |-- c4: string (nullable = true)
| | |-- c5: map (nullable = true)
| | | |-- key: string
| | | |-- value: integer (valueContainsNull = false)
I hope the answer is helpful