Custom aggregation on PySpark dataframes

前端 未结 1 959
遥遥无期
遥遥无期 2021-01-05 01:20

I have a PySpark DataFrame with one column as one hot encoded vectors. I want to aggregate the different one hot encoded vectors by vector addition after groupby

e.g

相关标签:
1条回答
  • 2021-01-05 01:53

    You have several options:

    1. Create a user defined aggregate function. The problem is that you will need to write the user defined aggregate function in scala and wrap it to use in python.
    2. You can use the collect_list function to collect all values to a list and then write a UDF to combine them.
    3. You can move to RDD and use aggregate or aggregate by key.

    Both options 2 & 3 would be relatively inefficient (costing both cpu and memory).

    0 讨论(0)
提交回复
热议问题