Count median grouped by day

I have a script which counts median value for all table data:

SELECT avg(t1.price) as median_val FROM (
SELECT @rownum:=@rownum+1 as `row_number`, d.price
  FROM mediana d,  (SELECT @rownum:=0) r
  WHERE 1
  ORDER BY d.price
) as t1, 
(
  SELECT count(*) as total_rows
  FROM mediana d
  WHERE 1
) as t2
AND t1.row_number>=total_rows/2 and t1.row_number<=total_rows/2+1;

Now I need to get median value not for all table values, but grouped by date. Is it possible? http://sqlfiddle.com/#!2/7cf27 - so as result I will get 2013-03-06 - 1.5 , 2013-03-05 - 3.5.

I hope I didn't loose myself and overcomplicate things, but here's what I came up with:

SELECT sq.created_at, avg(sq.price) as median_val FROM (
SELECT t1.row_number, t1.price, t1.created_at FROM(
SELECT IF(@prev!=d.created_at, @rownum:=1, @rownum:=@rownum+1) as `row_number`, d.price, @prev:=d.created_at AS created_at
FROM mediana d, (SELECT @rownum:=0, @prev:=NULL) r
ORDER BY created_at, price
) as t1 INNER JOIN  
(
  SELECT count(*) as total_rows, created_at 
  FROM mediana d
  GROUP BY created_at
) as t2
ON t1.created_at = t2.created_at
WHERE 1=1
AND t1.row_number>=t2.total_rows/2 and t1.row_number<=t2.total_rows/2+1
)sq
group by sq.created_at

What I did here, is mainly just to reset the rownumber to 1 when the date changes (it's important to order by created_at) and included the date so we can group by it. In the query which calculates total rows I also included created_at, so we can join the two subqueries.

Here is another take on the median inspired by this post using SUBSTRING_INDEX and GROUP_CONCAT. I am not sure about the performance on large tables relative to the method described by @fancyPants that uses row numbers, but on smaller tables (~20K rows) it works very fast.

SET SESSION group_concat_max_len = 1000000;
SELECT
    created_at,
    (
    CAST(
        SUBSTRING_INDEX(
        SUBSTRING_INDEX(
        GROUP_CONCAT(
            price ORDER BY price SEPARATOR ','),
            ',', FLOOR((COUNT(*)+1)/2) ), ',', -1) AS DECIMAL) +
    CAST(
        SUBSTRING_INDEX(
        SUBSTRING_INDEX(
        GROUP_CONCAT(
            price ORDER BY price SEPARATOR ','),
            ',', FLOOR((COUNT(*)+2)/2) ), ',', -1) AS DECIMAL)
    ) / 2.0 AS median_price
FROM
    mediana
GROUP BY
    created_at
;

Here is the output for the sqlfiddle given in the question (the fiddle appears to be broken, but I run this on the table shown in the fiddle within MySQL itself):

+------------+--------------+
| created_at | median_price |
+------------+--------------+
| 2012-03-05 |       3.5000 |
| 2012-03-06 |       1.5000 |
+------------+--------------+

The GROUP_CONCAT essentially creates a string representation of an array of prices per created_at date. The two SUBSTRING_INDEX commands then look for the middle value(s), i.e. the median. It is necessary to have two calls to the GROUP_CONCAT and average them to handle the case in which there are an even number of price elements for a single created_at date.

UPDATE:

It is worth mentioning that the GROUP_CONCAT function has a default length of 1024 bytes, see here. This may cause very long results to be truncated, which will cause a miscalculation. You can set a larger default with the command SET SESSION group_concat_max_len = N; where N is some other, larger value if you are concerned about large results. I have added that setting to the code snippet above. I chose 1000000, but you could use another value as well.

You can also spot check your results using COUNT(*) and OFFSET with one of your GROUP BY values. For example,

First get the count of the number of rows for a specific GROUP BY value,

SELECT COUNT(*) FROM mediana WHERE created_at = '2012-03-06';

Let X be the number of rows you get from step 1. Divide X by 2 to get half its value, Y.
Use the value Y as an offset to find the median.

a. If Y was a whole number then do both

SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET (Y-1);

and

SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET Y;

and average the two results to get the median value.

b. If Y was a decimal, then round Y down to the nearest whole number (call it W) and use that as a single offset,

SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET W;

and this will be your median value.

来源：https://stackoverflow.com/questions/15386799/count-median-grouped-by-day

标签

mysql

group-by

median