Function to Calculate Median in SQL Server

前端 未结 30 2807
孤独总比滥情好
孤独总比滥情好 2020-11-22 04:03

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u

30条回答
  •  时光说笑
    2020-11-22 04:30

    This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.

    SELECT
     sohCount.CustomerId,
     AVG(sohMid.TotalDue) as TotalDueMedian
    FROM 
    (SELECT 
      soh.CustomerId,
      COUNT(*) as NumberOfRows
    FROM 
      Sales.SalesOrderHeader soh 
    GROUP BY soh.CustomerId) As sohCount
    CROSS APPLY 
        (Select 
           soh.TotalDue
        FROM 
        Sales.SalesOrderHeader soh 
        WHERE soh.CustomerId = sohCount.CustomerId 
        ORDER BY soh.TotalDue
        OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
        FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
        ) As sohMid
    GROUP BY sohCount.CustomerId
    

    UPDATE

    I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

    Without index:

    • Mine 30%
    • Justin Grants 13%
    • Jeff Atwoods 58%

    And with index

    • Mine 3%.
    • Justin Grants 10%
    • Jeff Atwoods 87%

    I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:

    • Mine 3%.
    • Justin Grants 5%
    • Jeff Atwoods 92%

    I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:

    • Mine 4%
    • Justins 60%
    • Jeffs 35%

    I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

    A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.

提交回复
热议问题