pyspark approxQuantile function

后端 未结 3 1587
粉色の甜心
粉色の甜心 2021-02-06 02:55

I have dataframe with these columns id, price, timestamp.

I would like to find median value grouped by id.

I

3条回答
  •  再見小時候
    2021-02-06 03:43

    If you are fine with aggregation instead of the window function, there is also the option to use a pandas_udf. They are not as fast as pure Spark though. Here is an adapted example from the docs:

    from pyspark.sql.functions import pandas_udf, PandasUDFType
    
    df = spark.createDataFrame(
        [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "price")
    )
    
    @pandas_udf("double", PandasUDFType.GROUPED_AGG)
    def median_udf(v):
        return v.median()
    
    df.groupby("id").agg(median_udf(df["price"])).show()
    

提交回复
热议问题