How to calculate median of a numeric sequence in Google BigQuery efficiently?

问题

I need to calculate median value of a numeric sequence in Google BigQuery efficiently. Is the same possible?

回答1:

Yeah it's possible with PERCENTILE_CONT window function.

Returns values that are based upon linear interpolation between the values of the group, after ordering them per the ORDER BY clause.

must be between 0 and 1.

This window function requires ORDER BY in the OVER clause.

So an example query would be like (the max() is there just to work across the group by but it's not being used as a math logic, should not confuse you)

SELECT room,
      max(median) FROM   (SELECT room,
         percentile_cont(0.5) OVER (PARTITION BY room
                                    ORDER BY temperature) AS median    FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature)) GROUP BY room

This returns:

+------+-------------+
| room | temperature |
+------+-------------+
|    1 |          13 |
|    2 |          21 |
|    3 |          30 |
+------+-------------+

回答2:

Alternative solution, when you don't need absolutely exact results and approximation is fine - you can use combination of NTH and QUANTILES aggregation functions. The advantage of this method is that it is much more scalable than analytic window functions, but the disadvantage is that it gives approximate results.

SELECT room,
       NTH(50, QUANTILES(temperature, 101)) FROM
    (SELECT 1 AS room,
            11 AS temperature),
    (SELECT 1 AS room,
            12 AS temperature),
    (SELECT 1 AS room,
            14 AS temperature),
    (SELECT 1 AS room,
            19 AS temperature),
    (SELECT 1 AS room,
            13 AS temperature),
    (SELECT 2 AS room,
            20 AS temperature),
    (SELECT 2 AS room,
            21 AS temperature),
    (SELECT 2 AS room,
            29 AS temperature),
    (SELECT 3 AS room,
            30 AS temperature) GROUP BY room

This returns

room temperature 
1    13  
2    21  
3    30

回答3:

2018 update with more metrics:

BigQuery SQL: Average, geometric mean, remove outliers, median

For my own memory purposes, working queries with taxi data:

Approximate quantiles:

SELECT MONTH(pickup_datetime) month, NTH(51, QUANTILES(tip_amount,101)) median
FROM [nyc-tlc:green.trips_2015]
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

Gives the same results as PERCENTILE_DISC:

SELECT month, FIRST(median) median
FROM (
  SELECT MONTH(pickup_datetime) month, tip_amount, PERCENTILE_DISC(0.5) OVER(PARTITION BY month ORDER BY tip_amount) median
  FROM [nyc-tlc:green.trips_2015]
  WHERE tip_amount > 0
)
GROUP BY 1
ORDER BY 1

StandardSQL:

#StandardSQL
SELECT DATE_TRUNC(DATE(pickup_datetime), MONTH) month, APPROX_QUANTILES(tip_amount,1000)[OFFSET(500)] median
FROM `nyc-tlc.green.trips_2015`
WHERE tip_amount > 0
GROUP BY 1
ORDER BY 1

来源：https://stackoverflow.com/questions/29092758/how-to-calculate-median-of-a-numeric-sequence-in-google-bigquery-efficiently

标签

google-bigquery

median