Apache Spark — Assign the result of UDF to multiple dataframe columns

后端未结

关注

 2  1948

I\'m using pyspark, loading a large csv file into a dataframe with spark-csv, and as a pre-processing step I need to apply a variety of operations to the data available in o

相关标签:

2条回答

余生分开走

2020-12-02 09:56

It is not possible to create multiple top level columns from a single UDF call but you can create a new struct. It requires an UDF with specified returnType:

from pyspark.sql.functions import udf
from pyspark.sql.types import *

schema = StructType([
    StructField("foo", FloatType(), False),
    StructField("bar", FloatType(), False)
])

def udf_test(n):
    return (n / 2, n % 2) if n and n != 0.0 else (float('nan'), float('nan'))

test_udf = udf(udf_test, schema)
df = sc.parallelize([(1, 2.0), (2, 3.0)]).toDF(["x", "y"])

foobars = df.select(test_udf("y").alias("foobar"))
foobars.printSchema()
## root
##  |-- foobar: struct (nullable = true)
##  |    |-- foo: float (nullable = false)
##  |    |-- bar: float (nullable = false)

You further flatten the schema with simple select:

foobars.select("foobar.foo", "foobar.bar").show()
## +---+---+
## |foo|bar|
## +---+---+
## |1.0|0.0|
## |1.5|1.0|
## +---+---+

See also Derive multiple columns from a single column in a Spark DataFrame

0 讨论(0)

囚心锁ツ

2020-12-02 09:56
you can use flatMap to get the column the desired dataframe in one go
```
df=df.withColumn('udf_results',udf)  
df4=df.select('udf_results').rdd.flatMap(lambda x:x).toDF(schema=your_new_schema)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...