发表新帖

发表新帖

Efficient column processing in PySpark

后端未结

关注

 3  1745

死守一世寂寞 2021-01-15 13:46

I have a dataframe with a very large number of columns (>30000).

I\'m filling it with 1 and 0 based on the first column like this:

3条回答

说谎 (楼主)

2021-01-15 14:07
There is nothing specifically wrong with your code, other than very wide data:
```
for column in list_of_column_names:
    df = df.withColumn(...)
```
only generates the execution plan.

Actual data processing will concurrent and parallelized, once the result is evaluated.

It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.

Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs:
- Sort column array using sort_array function.
- Convert data to RDD.
- Apply search for each column using binary search.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题