Apache Spark — Assign the result of UDF to multiple dataframe columns

后端 未结 2 1948
面向向阳花
面向向阳花 2020-12-02 09:12

I\'m using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in o

相关标签:
2条回答
  • 2020-12-02 09:56

    It is not possible to create multiple top level columns from a single UDF call but you can create a new struct. It requires an UDF with specified returnType:

    from pyspark.sql.functions import udf
    from pyspark.sql.types import *
    
    schema = StructType([
        StructField("foo", FloatType(), False),
        StructField("bar", FloatType(), False)
    ])
    
    def udf_test(n):
        return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))
    
    test_udf = udf(udf_test, schema)
    df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])
    
    foobars = df.select(test_udf("y").alias("foobar"))
    foobars.printSchema()
    ## root
    ##  |-- foobar: struct (nullable = true)
    ##  |    |-- foo: float (nullable = false)
    ##  |    |-- bar: float (nullable = false)
    

    You further flatten the schema with simple select:

    foobars.select("foobar.foo", "foobar.bar").show()
    ## +---+---+
    ## |foo|bar|
    ## +---+---+
    ## |1.0|0.0|
    ## |1.5|1.0|
    ## +---+---+
    

    See also Derive multiple columns from a single column in a Spark DataFrame

    0 讨论(0)
  • 2020-12-02 09:56

    you can use flatMap to get the column the desired dataframe in one go

    df=df.withColumn('udf_results',udf)  
    df4=df.select('udf_results').rdd.flatMap(lambda x:x).toDF(schema=your_new_schema)
    
    0 讨论(0)
提交回复
热议问题