How do I exclude outliers from an aggregate query?

后端 未结 4 1813
情书的邮戳
情书的邮戳 2021-02-10 09:30

I\'m creating a report comparing total time and volume across units. Here a simplification of the query I\'m using at the moment:

SELECT  m.Unit,
        COUNT(         


        
相关标签:
4条回答
  • 2021-02-10 09:44

    I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.

    0 讨论(0)
  • 2021-02-10 09:47

    NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do

    select top 94.7368 percent *
    from (
    select top 95 percent *
        from 
        order by .. ASC
    ) X
    order by .. DESC
    

    First create a view to match your table column names

    create view main_table
    as
    select type unit, number as timeinminutes from master..spt_values
    

    Try this instead

    select Unit, COUNT(*), SUM(TimeInMinutes)
    FROM
    (
        select *,
            ROW_NUMBER() over (order by TimeInMinutes) rn,
            COUNT(*) over () countRows
        from main_table
    ) N -- Numbered
    where rn between countRows * 0.05 and countRows * 0.95
    group by Unit, N.countRows * 0.05, N.countRows * 0.95
    having count(*) > 20
    

    The HAVING clause is applied to the remaining set after removing outliers. For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.

    0 讨论(0)
  • 2021-02-10 09:54

    You can exclude the top and bottom x percentiles with NTILE

    SELECT m.Unit,
            COUNT(*) AS Count,
            SUM(m.TimeInMinutes) AS TotalTime
    FROM    
            (SELECT
                 m.Unit,
                 NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
             FROM
                 main_table m
             WHERE
                 m.unit <> '' AND m.TimeInMinutes > 0
            ) m
    WHERE   
          Buckets BETWEEN 2 AND 19
    GROUP BY m.Unit
    HAVING  COUNT(*) > 15
    

    Edit: this article has several techniques too

    0 讨论(0)
  • 2021-02-10 10:03

    One way would be to exclude the outliers with a not in clause:

    where  m.ID not in 
           (
           select  top 5 percent ID
           from    main_table 
           order by 
                   TimeInMinutes desc
           )
    

    And another not in clause for the bottom five percent.

    0 讨论(0)
提交回复
热议问题