How do I exclude outliers from an aggregate query?

后端 未结 4 1814
情书的邮戳
情书的邮戳 2021-02-10 09:30

I\'m creating a report comparing total time and volume across units. Here a simplification of the query I\'m using at the moment:

SELECT  m.Unit,
        COUNT(         


        
4条回答
  •  谎友^
    谎友^ (楼主)
    2021-02-10 09:47

    NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do

    select top 94.7368 percent *
    from (
    select top 95 percent *
        from 
        order by .. ASC
    ) X
    order by .. DESC
    

    First create a view to match your table column names

    create view main_table
    as
    select type unit, number as timeinminutes from master..spt_values
    

    Try this instead

    select Unit, COUNT(*), SUM(TimeInMinutes)
    FROM
    (
        select *,
            ROW_NUMBER() over (order by TimeInMinutes) rn,
            COUNT(*) over () countRows
        from main_table
    ) N -- Numbered
    where rn between countRows * 0.05 and countRows * 0.95
    group by Unit, N.countRows * 0.05, N.countRows * 0.95
    having count(*) > 20
    

    The HAVING clause is applied to the remaining set after removing outliers. For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.

提交回复
热议问题