I have a dataframe with a very large number of columns (>30000).
I\'m filling it with 1
and 0
based on the first column like this:
There is nothing specifically wrong with your code, other than very wide data:
for column in list_of_column_names:
df = df.withColumn(...)
only generates the execution plan.
Actual data processing will concurrent and parallelized, once the result is evaluated.
It is however an expensive process as it require O(NMK) operations with N rows, M columns and K values in the list.
Additionally execution plans on very wide data are very expensive to compute (though cost is constant in terms of number of records). If it becomes a limiting factor, you might be better off with RDDs
:
sort_array
function.RDD
.