According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (u
2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median
This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT
) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.
DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (@c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;
Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.
I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.
ORIGINAL 2009 POST IS BELOW:
There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.
That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.
As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
-- SalesOrderId in the ORDER BY is a disambiguator to break ties
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.
Here is a fragment from my result:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
This is as simple an answer as I could come up with. Worked well with my data. If you want to exclude certain values just add a where clause to the inner select.
SELECT TOP 1
ValueField AS MedianValue
FROM
(SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE)
ValueField
FROM
tTABLE
ORDER BY
ValueField) A
ORDER BY
ValueField DESC
My original quick answer was:
select max(my_column) as [my_column], quartile
from (select my_column, ntile(4) over (order by my_column) as [quartile]
from my_table) i
--where quartile = 2
group by quartile
This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.
When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.
I've amended the answer to follow the excellent suggestion from Robert Ševčík-Robajz in the comments below:
;with PartitionedData as
(select my_column, ntile(10) over (order by my_column) as [percentile]
from my_table),
MinimaAndMaxima as
(select min(my_column) as [low], max(my_column) as [high], percentile
from PartitionedData
group by percentile)
select
case
when b.percentile = 10 then cast(b.high as decimal(18,2))
else cast((a.low + b.high) as decimal(18,2)) / 2
end as [value], --b.high, a.low,
b.percentile
from MinimaAndMaxima a
join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5
This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.
DECLARE @Obs int
DECLARE @RowAsc table
(
ID INT IDENTITY,
Observation FLOAT
)
INSERT INTO @RowAsc
SELECT Observations FROM MyTable
ORDER BY 1
SELECT @Obs=COUNT(*)/2 FROM @RowAsc
SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs
For newbies like myself who are learning the very basics, I personally find this example easier to follow, as it is easier to understand exactly what's happening and where median values are coming from...
select
( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]
from (select
datediff(dd,startdate,enddate) as [Value1]
,xxxxxxxxxxxxxx as [Value2]
from dbo.table1
)a
In absolute awe of some of the codes above though!!!