pyspark approxQuantile function

后端未结

关注

 3  1600

粉色の甜心 2021-02-06 02:55

I have dataframe with these columns id, price, timestamp.

I would like to find median value grouped by id.

3条回答

再見小時候 (楼主)

2021-02-06 03:43

If you are fine with aggregation instead of the window function, there is also the option to use a pandas_udf. They are not as fast as pure Spark though. Here is an adapted example from the docs:

from pyspark.sql.functions import pandas_udf, PandasUDFType

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "price")
)

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def median_udf(v):
    return v.median()

df.groupby("id").agg(median_udf(df["price"])).show()

0 讨论(0)

查看其它3个回答