可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
Most databases have a built in function for calculating the median but I don't see anything for median in Amazon Redshift.
You could calculate the median using a combination of the nth_value() and count() analytic functions but that seems janky. I would be very surprised if an analytics db didn't have a built in method for computing median so I'm assuming I'm missing something.
http://docs.aws.amazon.com/redshift/latest/dg/r_Examples_of_NTH_WF.html http://docs.aws.amazon.com/redshift/latest/dg/c_Window_functions.html
回答1:
And as of 2014-10-17, Redshift supports the MEDIAN window function:
# select min(median) from (select median(num) over () from temp); min ----- 4.0
回答2:
Try the NTILE function.
You would divide your data into 2 ranked groups and pick the minimum value from the first group. That's because in datasets with an odd number of values, the first ntile will have 1 more value than the second. This approximation should work very well for large datasets.
create table temp (num smallint); insert into temp values (1),(5),(10),(2),(4); select num, ntile(2) over(order by num desc) from temp ; num | ntile -----+------- 10 | 1 5 | 1 4 | 1 2 | 2 1 | 2 select min(num) as median from (select num, ntile(2) over(order by num desc) from temp) where ntile = 1; median -------- 4
回答3:
I had difficulty with this also, but got some help from Amazon. Since the 2014-06-30 version of Redshift, you can do this with the PERCENTILE_CONT or PERCENTILE_DISC window functions.
They're slightly weird to use, as they will tack the median (or whatever percentile you choose) onto every row. You put that in a subquery and then take the MIN (or whatever) of the median column.
# select count(num), min(median) as median from (select num, percentile_cont (0.5) within group (order by num) over () as median from temp); count | median -------+-------- 5 | 4.0
(The reason it's complicated is that window functions can also do their own mini-group-by and ordering to give you the median of many groups all at once, and other tricks.)
In the case of an even number of values, CONT(inuous) will interpolate between the two middle values, where DISC(rete) will pick one of them.
回答4:
I typically use the NTILE function to split the data into two groups if I’m looking for an answer that’s close enough. However, if I want the exact median (e.g. the midpoint of an even set of rows), I use a technique suggested on the AWS Redshift Discussion Forum.
This technique orders the rows in both ascending and descending order, then if there is an odd number of rows, it returns the average of the middle row (that is, where row_num_asc = row_num_desc), which is simply the middle row itself.
CREATE TABLE temp (num SMALLINT); INSERT INTO temp VALUES (1),(5),(10),(2),(4); SELECT AVG(num) AS median FROM (SELECT num, SUM(1) OVER (ORDER BY num ASC) AS row_num_asc, SUM(1) OVER (ORDER BY num DESC) AS row_num_desc FROM temp) AS ordered WHERE row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1); median -------- 4
If there is an even number of rows, it returns the average of the two middle rows.
INSERT INTO temp VALUES (9); SELECT AVG(num) AS median FROM (SELECT num, SUM(1) OVER (ORDER BY num ASC) AS row_num_asc, SUM(1) OVER (ORDER BY num DESC) AS row_num_desc FROM temp) AS ordered WHERE row_num_asc IN (row_num_desc, row_num_desc - 1, row_num_desc + 1); median -------- 4.5