How to calculate Percentile of column in a DataFrame in spark?

前端 未结 2 1915
温柔的废话
温柔的废话 2021-02-20 08:57

I am trying to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions.

For e.g. in Hive we have perc

相关标签:
2条回答
  • 2021-02-20 09:24

    SparkSQL and the Scala dataframe/dataset APIs are executed by the same engine. Equivalent operations will generate equivalent execution plans. You can see the execution plans with explain.

    sql(...).explain
    df.explain
    

    When it comes to your specific question, it is a common pattern to intermix SparkSQL and Scala DSL syntax because, as you have discovered, their capabilities are not yet equivalent. (Another example is the difference between SQL's explode() and DSL's explode(), the latter being more powerful but also more inefficient due to marshalling.)

    The simple way to do it is as follows:

    df.registerTempTable("tmp_tbl")
    val newDF = sql(/* do something with tmp_tbl */)
    // Continue using newDF with Scala DSL
    

    What you need to keep in mind if you go with the simple way is that temporary table names are cluster-global (up to 1.6.x). Therefore, you should use randomized table names if the code may run simultaneously more than once on the same cluster.

    On my team the pattern is common-enough that we have added a .sql() implicit to DataFrame which automatically registers and then unregisters a temp table for the scope of the SQL statement.

    0 讨论(0)
  • 2021-02-20 09:24

    Since Spark2.0, things are getting easier,simply use this function in DataFrameStatFunctions like :

    df.stat.approxQuantile("Open_Rate",Array(0.25,0.50,0.75),0.0)

    There are also some useful statistic functions for DataFrame in DataFrameStatFunctions.

    0 讨论(0)
提交回复
热议问题