Find entries of 20 or more by contact within one minute of each entry

后端 未结 2 1640
终归单人心
终归单人心 2021-01-29 12:57

We are collecting some analytics data for contacts & each page they visit. A lot of the analytics data is from malicious attacks or bots, so they are hitting like 20+ pages

相关标签:
2条回答
  • 2021-01-29 13:23

    For data like:

    n, d
    John, 2020-01-01 00:00:10
    John, 2020-01-01 00:00:30
    John, 2020-01-01 00:00:50
    John, 2020-01-01 00:01:10
    John, 2020-01-01 00:01:30
    John, 2020-01-01 00:01:50
    

    You could group on raw minute precision of the date; it might be sufficient:

    SELECT n, DATEDIFF(minute, CAST(d as date), d) 
    FROM t
    GROUP BY n, DATEDIFF(minute, CAST(d as date), d) 
    HAVING COUNT(*) > 20
    

    Of course, you might get someone who blats 20 requests across the minute boundary so that half fall in each minute. You could counter for that by adding 30 seconds to all their times and unioning the two queries

    There are other things you could do such as a coordinated query that looks back over the past minute to find how many rows were within the same minute sliding window:

    SELECT 
      n, 
      (SELECT COUNT(*) FROM t tI WHERE tI.n = tO.n AND tI.d BETWEEN DATEADD(minute, -1, tO.d) AND tO.d) ct
    FROM 
      t tO
    

    This resultset could then be queried for a GROUP BY n HAVING MAX(ct) > 20..

    Footnote: it's a shame SQLS doesn't support ranging on dates in its window functions like Oracle does; COUNT(*) OVER(PARTITION BY n ORDER BY d RANGE BETWEEN INTERVAL 1 MINUTE PRECEDING AND 0) - SQLS understands range but only for "rows preceding/following/both that have the same value as the current row" and I don't believe there's a way to adjust your continuously variable datetime so that this applies

    0 讨论(0)
  • 2021-01-29 13:35

    You can get contacts that visited 20 pages within a minute using lag():

    select distinct contactid
    from (select t.*,
                 lag(datecreated, 19) over (partition by contactid order by datecreated) as lag20
          from t
         ) t
    where lag20 > dateadd(minute, -1, datecreated);
    

    That is, there are 20 rows within a minute if you look back 19 rows and that row is less than a minute before the current row.

    0 讨论(0)
提交回复
热议问题