I have dataframe with these columns id
, price
, timestamp
.
I would like to find median value grouped by id
.
I
As aggregated function is missing for groups, I'm adding an example of constructing function call by name (percentile_approx
for this case) :
from pyspark.sql.column import Column, _to_java_column, _to_seq
def from_name(sc, func_name, *params):
"""
create call by function name
"""
callUDF = sc._jvm.org.apache.spark.sql.functions.callUDF
func = callUDF(func_name, _to_seq(sc, *params, _to_java_column))
return Column(func)
Apply percentile_approx
function in groupBy:
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
# build percentile_approx function call by name:
target = from_name(sc, "percentile_approx", [f.col("salary"), f.lit(0.95)])
# load dataframe for persons data
# with columns "person_id", "group_id" and "salary"
persons = spark.read.parquet( ... )
# apply function for each group
persons.groupBy("group_id").agg(
target.alias("target")).show()