How to use the spark stats?

我们两清 提交于 2020-05-17 06:54:10

问题


I'm using spark-sql-2.4.1v, and I'm trying to do find quantiles i.e. percentile 0, percentile 25, etc, on each column of my given data.

As I am doing multiple percentiles, how to retrieve each calculated percentile from the results?

Here an example, having data as show below:

+----+---------+-------------+----------+-----------+
|  id|     date|total_revenue|con_dist_1| con_dist_2|
+----+---------+-------------+----------+-----------+
|3310|1/15/2018|  0.010680705|         6|0.019875458|
|3310|1/15/2018|  0.006628853|         4|0.816039063|
|3310|1/15/2018|   0.01378215|         4|0.082049528|
|3310|1/15/2018|  0.010680705|         6|0.019875458|
|3310|1/15/2018|  0.006628853|         4|0.816039063|
|3310|1/15/2018|   0.01378215|         4|0.082049528|
|3310|1/15/2018|  0.010680705|         6|0.019875458|
|3310|1/15/2018|  0.010680705|         6|0.019875458|
|3310|1/15/2018|  0.014933087|         5|0.034681906|
|3310|1/15/2018|  0.014448282|         3|0.082049528|
+----+---------+-------------+----------+-----------+

I need to calculate percentile 0, percentile25 etc, on "con_dist_1", "con_dist_2", etc.

I am doing the below for percentile 50:

val col_list = Array("con_dist_1","con_dist_2")
val median_col_list = partitioned_data.stat.approxQuantile(col_list, Array(0.5),0.0)
println(median_col_list)

It's giving this result:

median_col_list: Array[Array[Double]] = Array(Array(4.0), Array(0.034681906))

How to map the results? Is there any way which result is belongs to which column? Please suggest any better for the above.


回答1:


To calculate multiple percentiles at the same time, you can simple add them to the array you input to approxQuantile. For example, for 0, 25, 50, 75 and 100 you would do it as follows:

val col_list = Array("con_dist_1", "con_dist_2")
val percentiles = Array(0.0, 0.25, 0.5, 0.75, 1.0)
val median_col_list = partitioned_data.stat.approxQuantile(col_list, percentiles, 0.0)

The result will now be an array of arrays with all the percentiles.

To know which column the percentiles correspond to is simple, it depends on the order of the col_list. So in this case, median_col_list(0) corresponds to "con_dist_1" and median_col_list(1) to "con_dist_1". Following the same analogy, e.g. median_col_list(1)(2) would correspond to the 50 percentile for the "con_dist_1" column.



来源:https://stackoverflow.com/questions/60546150/how-to-use-the-spark-stats

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!