Custom aggregation on PySpark dataframes

前端未结

关注

 1  959

I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby

e.g

相关标签:

1条回答

遥遥无期

2021-01-05 01:53
You have several options:
1. Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
2. You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
3. You can move to RDD and use aggregate or aggregate by key.
Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).
0 讨论(0)
发布评论:

提交评论
- 加载中...