pyspark approxQuantile function

后端 未结 3 1586
粉色の甜心
粉色の甜心 2021-02-06 02:55

I have dataframe with these columns id, price, timestamp.

I would like to find median value grouped by id.

I

3条回答
  •  情歌与酒
    2021-02-06 03:25

    Calculating quantiles in groups (aggregated) example

    As aggregated function is missing for groups, I'm adding an example of constructing function call by name (percentile_approx for this case) :

    from pyspark.sql.column import Column, _to_java_column, _to_seq
    
    def from_name(sc, func_name, *params):
        """
           create call by function name 
        """
        callUDF = sc._jvm.org.apache.spark.sql.functions.callUDF
        func = callUDF(func_name, _to_seq(sc, *params, _to_java_column))
        return Column(func)
    

    Apply percentile_approx function in groupBy:

    from pyspark.sql import SparkSession
    from pyspark.sql import functions as f
    
    spark = SparkSession.builder.getOrCreate()
    sc = spark.sparkContext
    
    # build percentile_approx function call by name: 
    target = from_name(sc, "percentile_approx", [f.col("salary"), f.lit(0.95)])
    
    
    # load dataframe for persons data 
    # with columns "person_id", "group_id" and "salary"
    persons = spark.read.parquet( ... )
    
    # apply function for each group
    persons.groupBy("group_id").agg(
        target.alias("target")).show()
    

提交回复
热议问题