问题
I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case.
Original Dataset:
+---------------------|
| col1 | col2 |
+---------------------|
| AA| 11 |
| BB| 21 |
| AA| 12 |
| AA| 13 |
| BB| 22 |
| CC| 33 |
+---------------------|
I want to transform the dataset into the following array format. newColumn as an array of col2.
Transformed Dataset:
+---------------------|
|col1 | newColumn|
+---------------------|
| AA| [11,12,13]|
| BB| [21,22] |
| CC| [33] |
+---------------------|
I have seen this solution, but it uses collect_list and will not scale-out on big datasets.
回答1:
Using the inbuilt functions of spark are always the best way. I see no problem in using the collect_list function. As long as you have sufficient memory, this would be the best way. One way of optimizing your job would be to save your data as parquet , bucket it by column A and saving it as a table. Better would be to also partition it by some column that evenly distributes data.
For example,
df_stored = #load your data from csv or parquet or any format'
spark.catalog.setCurrentDatabase(database_name)
df_stored.write.mode("overwrite").format("parquet").partitionBy(part_col).bucketBy(10,"col1").option("path",savepath).saveAsTable(tablename)
df_analysis = spark.table(tablename)
df_aggreg = df_analysis.groupby('col1').agg(F.collect_list(col('col2')))
This would speeden up the aggregation and avoid a lot of shuffle. try it out
回答2:
- Load your dataframe
- Group by
col1
- Aggregate
col2
to a list usingcollect_list
import org.apache.spark.sql.functions
object GroupToArray {
def main(args: Array[String]): Unit = {
val spark = Constant.getSparkSess
import spark.implicits._
//Load your dataframe
val df = List(("AA", "11"),
("BB", "21"),
("AA", "12"),
("AA", "13"),
("BB", "22"),
("CC", "33")).toDF("col1","col2")
//Group by 'col1'
df.groupBy("col1")
//agregate on col2 and combine it to a list
.agg(functions.collect_list("col2").as("newColumn"))
.show()
}
}
来源:https://stackoverflow.com/questions/62546988/spark-data-set-transformation-to-array