How to calculate median of a numeric sequence in Google BigQuery efficiently?

强颜欢笑 提交于 2019-12-22 04:51:07

问题


I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?


回答1:


Yeah it's possible with PERCENTILE_CONT window function.

Returns values that are based upon linear interpolation between the values of the group, after ordering them per the ORDER BY clause.

must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

So an example query would be like (the max() is there just to work across the group by but it's not being used as a math logic, should not confuse you)

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room

This returns:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+



回答2:


Alternative solution, when you don't need absolutely exact results and approximation is fine - you can use combination of NTH and QUANTILES aggregation functions. The advantage of this method is that it is much more scalable than analytic window functions, but the disadvantage is that it gives approximate results.

SELECT room,
       NTH(50, QUANTILES(temperature, 101)) FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature) GROUP BY room

This returns

room temperature 
1    13  
2    21  
3    30



回答3:


2018 update with more metrics:

BigQuery SQL: Average, geometric mean, remove outliers, median


For my own memory purposes, working queries with taxi data:

Approximate quantiles:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

Gives the same results as PERCENTILE_DISC:

SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1

StandardSQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1


来源:https://stackoverflow.com/questions/29092758/how-to-calculate-median-of-a-numeric-sequence-in-google-bigquery-efficiently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!