Stack Overflow while processing several columns with a UDF

前端 未结 1 1779
情歌与酒
情歌与酒 2020-12-10 08:17

I have a DataFrame with many columns of str type, and I want to apply a function to all those columns, without renaming their names or adding more

相关标签:
1条回答
  • 2020-12-10 08:34

    Try something like this:

    from pyspark.sql.functions import col, lower, trim
    
    exprs = [
        lower(trim(col(c))).alias(c) if t == "string" else col(c) 
        for (c, t) in df.dtypes
    ]
    
    df.select(*exprs)
    

    This approach has two main advantages over you current solution:

    • it requires only as single projection (no growing lineage which most likely responsible for SO) instead of projection per string column.
    • it operates directly only an internal representation without passing data to Python (BatchPythonProcessing).
    0 讨论(0)
提交回复
热议问题