Stack Overflow while processing several columns with a UDF

前端未结

关注

 1  1779

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more

相关标签:

1条回答

你的背包

2020-12-10 08:34
Try something like this:
```
from pyspark.sql.functions import col, lower, trim

exprs = [
    lower(trim(col(c))).alias(c) if t == "string" else col(c) 
    for (c, t) in df.dtypes
]

df.select(*exprs)
```
This approach has two main advantages over you current solution:
- it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
- it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).
0 讨论(0)
发布评论:

提交评论
- 加载中...