Count median grouped by day

前提是你 提交于 2019-12-05 07:03:06

I hope I didn't loose myself and overcomplicate things, but here's what I came up with:

SELECT sq.created_at, avg(sq.price) as median_val FROM (
SELECT t1.row_number, t1.price, t1.created_at FROM(
SELECT IF(@prev!=d.created_at, @rownum:=1, @rownum:=@rownum+1) as `row_number`, d.price, @prev:=d.created_at AS created_at
FROM mediana d, (SELECT @rownum:=0, @prev:=NULL) r
ORDER BY created_at, price
) as t1 INNER JOIN  
(
  SELECT count(*) as total_rows, created_at 
  FROM mediana d
  GROUP BY created_at
) as t2
ON t1.created_at = t2.created_at
WHERE 1=1
AND t1.row_number>=t2.total_rows/2 and t1.row_number<=t2.total_rows/2+1
)sq
group by sq.created_at

What I did here, is mainly just to reset the rownumber to 1 when the date changes (it's important to order by created_at) and included the date so we can group by it. In the query which calculates total rows I also included created_at, so we can join the two subqueries.

Here is another take on the median inspired by this post using SUBSTRING_INDEX and GROUP_CONCAT. I am not sure about the performance on large tables relative to the method described by @fancyPants that uses row numbers, but on smaller tables (~20K rows) it works very fast.

SET SESSION group_concat_max_len = 1000000;
SELECT
    created_at,
    (
    CAST(
        SUBSTRING_INDEX(
        SUBSTRING_INDEX(
        GROUP_CONCAT(
            price ORDER BY price SEPARATOR ','),
            ',', FLOOR((COUNT(*)+1)/2) ), ',', -1) AS DECIMAL) +
    CAST(
        SUBSTRING_INDEX(
        SUBSTRING_INDEX(
        GROUP_CONCAT(
            price ORDER BY price SEPARATOR ','),
            ',', FLOOR((COUNT(*)+2)/2) ), ',', -1) AS DECIMAL)
    ) / 2.0 AS median_price
FROM
    mediana
GROUP BY
    created_at
;

Here is the output for the sqlfiddle given in the question (the fiddle appears to be broken, but I run this on the table shown in the fiddle within MySQL itself):

+------------+--------------+
| created_at | median_price |
+------------+--------------+
| 2012-03-05 |       3.5000 |
| 2012-03-06 |       1.5000 |
+------------+--------------+

The GROUP_CONCAT essentially creates a string representation of an array of prices per created_at date. The two SUBSTRING_INDEX commands then look for the middle value(s), i.e. the median. It is necessary to have two calls to the GROUP_CONCAT and average them to handle the case in which there are an even number of price elements for a single created_at date.

UPDATE:

It is worth mentioning that the GROUP_CONCAT function has a default length of 1024 bytes, see here. This may cause very long results to be truncated, which will cause a miscalculation. You can set a larger default with the command SET SESSION group_concat_max_len = N; where N is some other, larger value if you are concerned about large results. I have added that setting to the code snippet above. I chose 1000000, but you could use another value as well.

You can also spot check your results using COUNT(*) and OFFSET with one of your GROUP BY values. For example,

  1. First get the count of the number of rows for a specific GROUP BY value,

SELECT COUNT(*) FROM mediana WHERE created_at = '2012-03-06';

  1. Let X be the number of rows you get from step 1. Divide X by 2 to get half its value, Y.

  2. Use the value Y as an offset to find the median.

    a. If Y was a whole number then do both

    SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET (Y-1);

    and

    SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET Y;

    and average the two results to get the median value.

    b. If Y was a decimal, then round Y down to the nearest whole number (call it W) and use that as a single offset,

    SELECT price FROM mediana WHERE created_at = '2012-03-06' ORDER BY price LIMIT 1 OFFSET W;

    and this will be your median value.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!