Function to Calculate Median in SQL Server

前端 未结 30 2707
孤独总比滥情好
孤独总比滥情好 2020-11-22 04:03

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u

相关标签:
30条回答
  • 2020-11-22 04:26

    Simple, fast, accurate

    SELECT x.Amount 
    FROM   (SELECT amount, 
                   Count(1) OVER (partition BY 'A')        AS TotalRows, 
                   Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder 
            FROM   facttransaction ft) x 
    WHERE  x.AmountOrder = Round(x.TotalRows / 2.0, 0)  
    
    0 讨论(0)
  • 2020-11-22 04:27

    Using a single statement - One way is to use ROW_NUMBER() and filter using sub-query. Here is to find the median salary:

    SELECT AVG(a.Salary) FROM                                                             
    (SELECT ROW_NUMBER() OVER(ORDER BY Salary) as row_no, Salary FROM Employee)a
    CROSS JOIN
    (SELECT (COUNT(*)+1)*0.5 AS row_half FROM Employee )t
    WHERE a.row_no IN (FLOOR(t.row_half),CEILING(t.row_half))
    

    I have seen similar solutions over the net using FLOOR and CEILING but tried to use a single statement.

    0 讨论(0)
  • 2020-11-22 04:28

    In SQL Server 2012 you should use PERCENTILE_CONT:

    SELECT SalesOrderID, OrderQty,
        PERCENTILE_CONT(0.5) 
            WITHIN GROUP (ORDER BY OrderQty)
            OVER (PARTITION BY SalesOrderID) AS MedianCont
    FROM Sales.SalesOrderDetail
    WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
    ORDER BY SalesOrderID DESC
    

    See also : http://blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/

    0 讨论(0)
  • 2020-11-22 04:29

    MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx

    0 讨论(0)
  • 2020-11-22 04:30

    I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P

    DECLARE @table AS TABLE
    (
        Number int not null
    );
    
    insert into @table select 2;
    insert into @table select 4;
    insert into @table select 9;
    insert into @table select 15;
    insert into @table select 22;
    insert into @table select 26;
    insert into @table select 37;
    insert into @table select 49;
    
    DECLARE @Count AS INT
    SELECT @Count = COUNT(*) FROM @table;
    
    WITH MyResults(RowNo, Number) AS
    (
        SELECT RowNo, Number FROM
            (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
    )
    SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
    
    0 讨论(0)
  • 2020-11-22 04:30

    This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.

    SELECT
     sohCount.CustomerId,
     AVG(sohMid.TotalDue) as TotalDueMedian
    FROM 
    (SELECT 
      soh.CustomerId,
      COUNT(*) as NumberOfRows
    FROM 
      Sales.SalesOrderHeader soh 
    GROUP BY soh.CustomerId) As sohCount
    CROSS APPLY 
        (Select 
           soh.TotalDue
        FROM 
        Sales.SalesOrderHeader soh 
        WHERE soh.CustomerId = sohCount.CustomerId 
        ORDER BY soh.TotalDue
        OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
        FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
        ) As sohMid
    GROUP BY sohCount.CustomerId
    

    UPDATE

    I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

    Without index:

    • Mine 30%
    • Justin Grants 13%
    • Jeff Atwoods 58%

    And with index

    • Mine 3%.
    • Justin Grants 10%
    • Jeff Atwoods 87%

    I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

    • Mine 3%.
    • Justin Grants 5%
    • Jeff Atwoods 92%

    I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:

    • Mine 4%
    • Justins 60%
    • Jeffs 35%

    I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

    A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.

    0 讨论(0)
提交回复
热议问题