Simple way to calculate median with MySQL

后端 未结 30 1108
北荒
北荒 2020-11-22 04:20

What\'s the simplest (and hopefully not too slow) way to calculate the median with MySQL? I\'ve used AVG(x) for finding the mean, but I\'m having a hard time fi

相关标签:
30条回答
  • 2020-11-22 04:39

    Install and use this mysql statistical functions: http://www.xarg.org/2012/07/statistical-functions-in-mysql/

    After that, calculate median is easy:

    SELECT median(val) FROM data;
    
    0 讨论(0)
  • 2020-11-22 04:39

    as i just needed a median AND percentile solution, I made a simple and quite flexible function based on the findings in this thread. I know that I am happy myself if I find "readymade" functions that are easy to include in my projects, so I decided to quickly share:

    function mysql_percentile($table, $column, $where, $percentile = 0.5) {
    
        $sql = "
                SELECT `t1`.`".$column."` as `percentile` FROM (
                SELECT @rownum:=@rownum+1 as `row_number`, `d`.`".$column."`
                  FROM `".$table."` `d`,  (SELECT @rownum:=0) `r`
                  ".$where."
                  ORDER BY `d`.`".$column."`
                ) as `t1`, 
                (
                  SELECT count(*) as `total_rows`
                  FROM `".$table."` `d`
                  ".$where."
                ) as `t2`
                WHERE 1
                AND `t1`.`row_number`=floor(`total_rows` * ".$percentile.")+1;
            ";
    
        $result = sql($sql, 1);
    
        if (!empty($result)) {
            return $result['percentile'];       
        } else {
            return 0;
        }
    
    }
    

    Usage is very easy, example from my current project:

    ...
    $table = DBPRE."zip_".$slug;
    $column = 'seconds';
    $where = "WHERE `reached` = '1' AND `time` >= '".$start_time."'";
    
        $reaching['median'] = mysql_percentile($table, $column, $where, 0.5);
        $reaching['percentile25'] = mysql_percentile($table, $column, $where, 0.25);
        $reaching['percentile75'] = mysql_percentile($table, $column, $where, 0.75);
    ...
    
    0 讨论(0)
  • 2020-11-22 04:41

    Unfortunately, neither TheJacobTaylor's nor velcrow's answers return accurate results for current versions of MySQL.

    Velcro's answer from above is close, but it does not calculate correctly for result sets with an even number of rows. Medians are defined as either 1) the middle number on odd numbered sets, or 2) the average of the two middle numbers on even number sets.

    So, here's velcro's solution patched to handle both odd and even number sets:

    SELECT AVG(middle_values) AS 'median' FROM (
      SELECT t1.median_column AS 'middle_values' FROM
        (
          SELECT @row:=@row+1 as `row`, x.median_column
          FROM median_table AS x, (SELECT @row:=0) AS r
          WHERE 1
          -- put some where clause here
          ORDER BY x.median_column
        ) AS t1,
        (
          SELECT COUNT(*) as 'count'
          FROM median_table x
          WHERE 1
          -- put same where clause here
        ) AS t2
        -- the following condition will return 1 record for odd number sets, or 2 records for even number sets.
        WHERE t1.row >= t2.count/2 and t1.row <= ((t2.count/2) +1)) AS t3;
    

    To use this, follow these 3 easy steps:

    1. Replace "median_table" (2 occurrences) in the above code with the name of your table
    2. Replace "median_column" (3 occurrences) with the column name you'd like to find a median for
    3. If you have a WHERE condition, replace "WHERE 1" (2 occurrences) with your where condition
    0 讨论(0)
  • 2020-11-22 04:41

    MySQL has supported window functions since version 8.0, you can use ROW_NUMBER or DENSE_RANK (DO NOT use RANK as it assigns the same rank to same values, like in sports ranking):

    SELECT AVG(t1.val) AS median_val
      FROM (SELECT val, 
                   ROW_NUMBER() OVER(ORDER BY val) AS rownum
              FROM data) t1,
           (SELECT COUNT(*) AS num_records FROM data) t2
     WHERE t1.row_num IN
           (FLOOR((t2.num_records + 1) / 2), 
            FLOOR((t2.num_records + 2) / 2));
    
    0 讨论(0)
  • 2020-11-22 04:41

    I have a database containing about 1 billion rows that we require to determine the median age in the set. Sorting a billion rows is hard, but if you aggregate the distinct values that can be found (ages range from 0 to 100), you can sort THIS list, and use some arithmetic magic to find any percentile you want as follows:

    with rawData(count_value) as
    (
        select p.YEAR_OF_BIRTH
            from dbo.PERSON p
    ),
    overallStats (avg_value, stdev_value, min_value, max_value, total) as
    (
      select avg(1.0 * count_value) as avg_value,
        stdev(count_value) as stdev_value,
        min(count_value) as min_value,
        max(count_value) as max_value,
        count(*) as total
      from rawData
    ),
    aggData (count_value, total, accumulated) as
    (
      select count_value, 
        count(*) as total, 
            SUM(count(*)) OVER (ORDER BY count_value ROWS UNBOUNDED PRECEDING) as accumulated
      FROM rawData
      group by count_value
    )
    select o.total as count_value,
      o.min_value,
        o.max_value,
        o.avg_value,
        o.stdev_value,
        MIN(case when d.accumulated >= .50 * o.total then count_value else o.max_value end) as median_value,
        MIN(case when d.accumulated >= .10 * o.total then count_value else o.max_value end) as p10_value,
        MIN(case when d.accumulated >= .25 * o.total then count_value else o.max_value end) as p25_value,
        MIN(case when d.accumulated >= .75 * o.total then count_value else o.max_value end) as p75_value,
        MIN(case when d.accumulated >= .90 * o.total then count_value else o.max_value end) as p90_value
    from aggData d
    cross apply overallStats o
    GROUP BY o.total, o.min_value, o.max_value, o.avg_value, o.stdev_value
    ;
    

    This query depends on your db supporting window functions (including ROWS UNBOUNDED PRECEDING) but if you do not have that it is a simple matter to join aggData CTE with itself and aggregate all prior totals into the 'accumulated' column which is used to determine which value contains the specified precentile. The above sample calcuates p10, p25, p50 (median), p75, and p90.

    -Chris

    0 讨论(0)
  • 2020-11-22 04:42

    Knowing exact row count you can use this query:

    SELECT <value> AS VAL FROM <table> ORDER BY VAL LIMIT 1 OFFSET <half>
    

    Where <half> = ceiling(<size> / 2.0) - 1

    0 讨论(0)
提交回复
热议问题