PySpark. Passing a Dataframe to a pandas_udf and returning a series

前端 未结 1 1040
没有蜡笔的小新
没有蜡笔的小新 2021-01-07 15:08

I\'m using PySpark\'s new pandas_udf decorator and I\'m trying to get it to take multiple columns as an input and return a series as an input, however, I get a

相关标签:
1条回答
  • 2021-01-07 15:34

    A SCALAR udf expects pandas series as input instead of a data frame. For your case, there's no need to use a udf. Direct calculation from columns a, b, c after clipping should work:

    import pyspark.sql.functions as f
    
    df = spark.createDataFrame([[1,2,4],[-1,2,2]], ['a', 'b', 'c'])
    
    clip = lambda x: f.when(df.a < 0, 0).otherwise(x)
    df.withColumn('d', (clip(df.a) - clip(df.b)) / clip(df.c)).show()
    
    #+---+---+---+-----+
    #|  a|  b|  c|    d|
    #+---+---+---+-----+
    #|  1|  2|  4|-0.25|
    #| -1|  2|  2| null|
    #+---+---+---+-----+
    

    And if you have to use a pandas_udf, your return type needs to be double, not df.schema because you only return a pandas series not a pandas data frame; And also you need to pass columns as Series into the function not the whole data frame:

    @pandas_udf('double', PandasUDFType.SCALAR)
    def fun_function(a, b, c):
        clip = lambda x: x.where(a >= 0, 0)
        return (clip(a) - clip(b)) / clip(c)
    
    df.withColumn('d', fun_function(df.a, df.b, df.c)).show()
    #+---+---+---+-----+                                                             
    #|  a|  b|  c|    d|
    #+---+---+---+-----+
    #|  1|  2|  4|-0.25|
    #| -1|  2|  2| null|
    #+---+---+---+-----+
    
    0 讨论(0)
提交回复
热议问题