Function to Calculate Median in SQL Server

前端未结

关注

 30  2807

孤独总比滥情好 2020-11-22 04:03

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u

30条回答

时光说笑 (楼主)

2020-11-22 04:30
This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.
```
SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId
```
UPDATE

I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

Without index:
- Mine 30%
- Justin Grants 13%
- Jeff Atwoods 58%
And with index
- Mine 3%.
- Justin Grants 10%
- Jeff Atwoods 87%
I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:
- Mine 3%.
- Justin Grants 5%
- Jeff Atwoods 92%
I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:
- Mine 4%
- Justins 60%
- Jeffs 35%
I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.
0 讨论(0)

查看其它30个回答
发布评论:

提交评论
- 加载中...