Function to Calculate Median in SQL Server

前端 未结 30 2709
孤独总比滥情好
孤独总比滥情好 2020-11-22 04:03

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u

相关标签:
30条回答
  • 2020-11-22 04:31

    2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median

    This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.

    DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
    
    SELECT AVG(1.0 * val)
    FROM (
        SELECT val FROM dbo.EvenRows
         ORDER BY val
         OFFSET (@c - 1) / 2 ROWS
         FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
    ) AS x;
    

    Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.

    I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.

    ORIGINAL 2009 POST IS BELOW:

    There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.

    That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.

    As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.

    SELECT
       CustomerId,
       AVG(TotalDue)
    FROM
    (
       SELECT
          CustomerId,
          TotalDue,
          -- SalesOrderId in the ORDER BY is a disambiguator to break ties
          ROW_NUMBER() OVER (
             PARTITION BY CustomerId
             ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
          ROW_NUMBER() OVER (
             PARTITION BY CustomerId
             ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
       FROM Sales.SalesOrderHeader SOH
    ) x
    WHERE
       RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
    GROUP BY CustomerId
    ORDER BY CustomerId;
    
    0 讨论(0)
  • 2020-11-22 04:31

    Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.

    Here is a fragment from my result:

    KEY VALUE ROWA ROWD  
    
    13  2     22   182
    13  1     6    183
    13  1     7    184
    13  1     8    185
    13  1     9    186
    13  1     10   187
    13  1     11   188
    13  1     12   189
    13  0     1    190
    13  0     2    191
    13  0     3    192
    13  0     4    193
    13  0     5    194
    

    I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.

    SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
    FROM
    (
      SELECT PKEY,VALUE,ROWA,ROWD,
      'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
      FROM
      (
        SELECT
        PKEY,
        cast(VALUE as decimal(5,2)) as VALUE,
        ROWA,
        ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD 
    
        FROM
        (
          SELECT
          PKEY, 
          VALUE,
          ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA 
          FROM [MTEST]
        )T1
      )T2
    )T3
    WHERE FLAG = '1'
    GROUP BY PKEY
    ORDER BY PKEY
    
    0 讨论(0)
  • This is as simple an answer as I could come up with. Worked well with my data. If you want to exclude certain values just add a where clause to the inner select.

    SELECT TOP 1 
        ValueField AS MedianValue
    FROM
        (SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE)
            ValueField
        FROM 
            tTABLE
        ORDER BY 
            ValueField) A
    ORDER BY
        ValueField DESC
    
    0 讨论(0)
  • 2020-11-22 04:32

    My original quick answer was:

    select  max(my_column) as [my_column], quartile
    from    (select my_column, ntile(4) over (order by my_column) as [quartile]
             from   my_table) i
    --where quartile = 2
    group by quartile
    

    This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.

    When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.

    I've amended the answer to follow the excellent suggestion from Robert Ševčík-Robajz in the comments below:

    ;with PartitionedData as
      (select my_column, ntile(10) over (order by my_column) as [percentile]
       from   my_table),
    MinimaAndMaxima as
      (select  min(my_column) as [low], max(my_column) as [high], percentile
       from    PartitionedData
       group by percentile)
    select
      case
        when b.percentile = 10 then cast(b.high as decimal(18,2))
        else cast((a.low + b.high)  as decimal(18,2)) / 2
      end as [value], --b.high, a.low,
      b.percentile
    from    MinimaAndMaxima a
      join  MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
    --where b.percentile = 5
    

    This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.

    0 讨论(0)
  • 2020-11-22 04:32
    DECLARE @Obs int
    DECLARE @RowAsc table
    (
    ID      INT IDENTITY,
    Observation  FLOAT
    )
    INSERT INTO @RowAsc
    SELECT Observations FROM MyTable
    ORDER BY 1 
    SELECT @Obs=COUNT(*)/2 FROM @RowAsc
    SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs
    
    0 讨论(0)
  • 2020-11-22 04:35

    For newbies like myself who are learning the very basics, I personally find this example easier to follow, as it is easier to understand exactly what's happening and where median values are coming from...

    select
     ( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
    ,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]
    
    from (select
        datediff(dd,startdate,enddate) as [Value1]
        ,xxxxxxxxxxxxxx as [Value2]
         from dbo.table1
         )a
    

    In absolute awe of some of the codes above though!!!

    0 讨论(0)
提交回复
热议问题