Function to Calculate Median in SQL Server

前端未结

关注

 30  2806

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u

相关标签:

30条回答

温柔的废话

2020-11-22 04:26

Simple, fast, accurate

SELECT x.Amount 
FROM   (SELECT amount, 
               Count(1) OVER (partition BY 'A')        AS TotalRows, 
               Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder 
        FROM   facttransaction ft) x 
WHERE  x.AmountOrder = Round(x.TotalRows / 2.0, 0)

0 讨论(0)

花落未央

2020-11-22 04:27
Using a single statement - One way is to use ROW_NUMBER() and filter using sub-query. Here is to find the median salary:
```
SELECT AVG(a.Salary) FROM                                                             
(SELECT ROW_NUMBER() OVER(ORDER BY Salary) as row_no, Salary FROM Employee)a
CROSS JOIN
(SELECT (COUNT(*)+1)*0.5 AS row_half FROM Employee )t
WHERE a.row_no IN (FLOOR(t.row_half),CEILING(t.row_half))
```
I have seen similar solutions over the net using FLOOR and CEILING but tried to use a single statement.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-11-22 04:28
In SQL Server 2012 you should use PERCENTILE_CONT:
```
SELECT SalesOrderID, OrderQty,
    PERCENTILE_CONT(0.5) 
        WITHIN GROUP (ORDER BY OrderQty)
        OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
```
See also : http://blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/
0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2020-11-22 04:29

MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx

0 讨论(0)
发布评论:

提交评论
- 加载中...

半阙折子戏

2020-11-22 04:30

I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P

DECLARE @table AS TABLE
(
    Number int not null
);

insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;

DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;

WITH MyResults(RowNo, Number) AS
(
    SELECT RowNo, Number FROM
        (SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)

0 讨论(0)

时光说笑

2020-11-22 04:30
This is the most optimal solution for finding medians that I can think of. The names in the example is based on Justin example. Make sure an index for table Sales.SalesOrderHeader exists with index columns CustomerId and TotalDue in that order.
```
SELECT
 sohCount.CustomerId,
 AVG(sohMid.TotalDue) as TotalDueMedian
FROM 
(SELECT 
  soh.CustomerId,
  COUNT(*) as NumberOfRows
FROM 
  Sales.SalesOrderHeader soh 
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY 
    (Select 
       soh.TotalDue
    FROM 
    Sales.SalesOrderHeader soh 
    WHERE soh.CustomerId = sohCount.CustomerId 
    ORDER BY soh.TotalDue
    OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS 
    FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
    ) As sohMid
GROUP BY sohCount.CustomerId
```
UPDATE

I was a bit unsure about which method has best performance, so I did a comparison between my method Justin Grants and Jeff Atwoods by running query based on all three methods in one batch and the batch cost of each query were:

Without index:
- Mine 30%
- Justin Grants 13%
- Jeff Atwoods 58%
And with index
- Mine 3%.
- Justin Grants 10%
- Jeff Atwoods 87%
I tried to see how well the queries scale if you have index by creating more data from around 14 000 rows by a factor of 2 up to 512 which means in the end around 7,2 millions rows. Note I made sure CustomeId field where unique for each time I did a single copy, so the proportion of rows compared to unique instance of CustomerId was kept constant. While I was doing this I ran executions where I rebuilt index afterwards, and I noticed the results stabilized at around a factor of 128 with the data I had to these values:
- Mine 3%.
- Justin Grants 5%
- Jeff Atwoods 92%
I wondered how the performance could have been affected by scaling number of of rows but keeping unique CustomerId constant, so I setup a new test where I did just this. Now instead of stabilizing, the batch cost ratio kept diverging, also instead of about 20 rows per CustomerId per average I had in the end around 10000 rows per such unique Id. The numbers where:
- Mine 4%
- Justins 60%
- Jeffs 35%
I made sure I implemented each method correct by comparing the results. My conclusion is the method I used is generally faster as long as index exists. Also noticed that this method is what's recommended for this particular problem in this article https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5

A way to even further improve performance of subsequent calls to this query even further is to persist the count information in an auxiliary table. You could even maintain it by having a trigger that update and holds information regarding the count of SalesOrderHeader rows dependant on CustomerId, of course you can then simple store the median as well.
0 讨论(0)
发布评论:

提交评论
- 加载中...