I have dataframe with these columns id
, price
, timestamp
.
I would like to find median value grouped by id
.
I
If you are fine with aggregation instead of the window function, there is also the option to use a pandas_udf. They are not as fast as pure Spark though. Here is an adapted example from the docs:
from pyspark.sql.functions import pandas_udf, PandasUDFType
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "price")
)
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def median_udf(v):
return v.median()
df.groupby("id").agg(median_udf(df["price"])).show()