pyspark approxQuantile function

為{幸葍}努か 提交于 2020-01-01 03:10:50

问题


I have dataframe with these columns id, price, timestamp.

I would like to find median value grouped by id.

I am using this code to find it but it's giving me this error.

from pyspark.sql import DataFrameStatFunctions as statFunc
windowSpec = Window.partitionBy("id")
median = statFunc.approxQuantile("price",
                                 [0.5],
                                 0) \
                 .over(windowSpec)

return df.withColumn("Median", median)

Is it not possible to use DataFrameStatFunctions to fill values in new column?

TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

回答1:


Well, indeed it is not possible to use approxQuantile to fill values in a new dataframe column, but this is not why you are getting this error. Unfortunately, the whole underneath story is a rather frustrating one, as I have argued that is the case with many Spark (especially PySpark) features and their lack of adequate documentation.

To start with, there is not one, but two approxQuantile methods; the first one is part of the standard DataFrame class, i.e. you don't need to import DataFrameStatFunctions:

spark.version
# u'2.1.1'

sampleData = [("bob","Developer",125000),("mark","Developer",108000),("carl","Tester",70000),("peter","Developer",185000),("jon","Tester",65000),("roman","Tester",82000),("simon","Developer",98000),("eric","Developer",144000),("carlos","Tester",75000),("henry","Developer",110000)]

df = spark.createDataFrame(sampleData, schema=["Name","Role","Salary"])
df.show()
# +------+---------+------+ 
# |  Name|     Role|Salary|
# +------+---------+------+
# |   bob|Developer|125000| 
# |  mark|Developer|108000|
# |  carl|   Tester| 70000|
# | peter|Developer|185000|
# |   jon|   Tester| 65000|
# | roman|   Tester| 82000|
# | simon|Developer| 98000|
# |  eric|Developer|144000|
# |carlos|   Tester| 75000|
# | henry|Developer|110000|
# +------+---------+------+

med = df.approxQuantile("Salary", [0.5], 0.25) # no need to import DataFrameStatFunctions
med
# [98000.0]

The second one is part of DataFrameStatFunctions, but if you use it as you do, you get the error you report:

from pyspark.sql import DataFrameStatFunctions as statFunc
med2 = statFunc.approxQuantile( "Salary", [0.5], 0.25)
# TypeError: unbound method approxQuantile() must be called with DataFrameStatFunctions instance as first argument (got str instance instead)

because the correct usage is

med2 = statFunc(df).approxQuantile( "Salary", [0.5], 0.25)
med2
# [82000.0]

although you won't be able to find a simple example in the PySpark documentation about this (it took me some time to figure it out myself)... The best part? The two values are not equal:

med == med2
# False

I suspect this is due to the non-deterministic algorithm used (after all, it is supposed to be an approximate median), and even if you re-run the commands with the same toy data you may get different values (and different from the ones I report here) - I suggest to experiment a little to get the feeling...

But, as I already said, this is not the reason why you cannot use approxQuantile to fill values in a new dataframe column - even if you use the correct syntax, you will get a different error:

df2 = df.withColumn('median_salary', statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# AssertionError: col should be Column

Here, col refers to the second argument of the withColumn operation, i.e. the approxQuantile one, and the error message says that it is not a Column type - indeed, it is a list:

type(statFunc(df).approxQuantile( "Salary", [0.5], 0.25))
# list

So, when filling column values, Spark expects arguments of type Column, and you cannot use lists; here is an example of creating a new column with mean values per Role instead of median ones:

import pyspark.sql.functions as func
from pyspark.sql import Window

windowSpec = Window.partitionBy(df['Role'])
df2 = df.withColumn('mean_salary', func.mean(df['Salary']).over(windowSpec))
df2.show()
# +------+---------+------+------------------+
# |  Name|     Role|Salary|       mean_salary| 
# +------+---------+------+------------------+
# |  carl|   Tester| 70000|           73000.0| 
# |   jon|   Tester| 65000|           73000.0|
# | roman|   Tester| 82000|           73000.0|
# |carlos|   Tester| 75000|           73000.0|
# |   bob|Developer|125000|128333.33333333333|
# |  mark|Developer|108000|128333.33333333333| 
# | peter|Developer|185000|128333.33333333333| 
# | simon|Developer| 98000|128333.33333333333| 
# |  eric|Developer|144000|128333.33333333333|
# | henry|Developer|110000|128333.33333333333| 
# +------+---------+------+------------------+

which works because, contrary to approxQuantile, mean returns a Column:

type(func.mean(df['Salary']).over(windowSpec))
# pyspark.sql.column.Column



回答2:


Calculating quantiles in groups (aggregated) example

As aggregated function is missing for groups, I'm adding an example of constructing function call by name (percentile_approx for this case) :

from pyspark.sql.column import Column, _to_java_column, _to_seq

def from_name(sc, func_name, *params):
    """
       create call by function name 
    """
    callUDF = sc._jvm.org.apache.spark.sql.functions.callUDF
    func = callUDF(func_name, _to_seq(sc, *params, _to_java_column))
    return Column(func)

Apply percentile_approx function in groupBy:

from pyspark.sql import SparkSession
from pyspark.sql import functions as f

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

# build percentile_approx function call by name: 
target = from_name(sc, "percentile_approx", [f.col("salary"), f.lit(0.95)])


# load dataframe for persons data 
# with columns "person_id", "group_id" and "salary"
persons = spark.read.parquet( ... )

# apply function for each group
persons.groupBy("group_id").agg(
    target.alias("target")).show()


来源:https://stackoverflow.com/questions/45287832/pyspark-approxquantile-function

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!