SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

后端 未结 3 1623
一整个雨季
一整个雨季 2021-01-19 01:52

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities. I have a data

相关标签:
3条回答
  • 2021-01-19 02:00

    See SQL Fiddle here.

    with tbl as (
      -- this will generate daily merged ovelaping time
      select distinct
        a.id
        ,(
            select min(x.starttime) 
            from act x 
            where x.id=a.id and ( x.starttime between a.starttime and a.endtime
              or a.starttime between x.starttime and x.endtime )
        ) start1
        ,(
            select max(x.endtime) 
            from act x 
            where x.id=a.id and ( x.endtime between a.starttime and a.endtime
              or a.endtime between x.starttime and x.endtime )
        ) end1
      from act a
    
    ), tbl2 as 
    (
      -- this will add minute and total minute column
      select 
        * 
        ,datediff(mi,t.start1,t.end1) mi
        ,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
      from tbl t
    ), tbl3 as 
    (
      -- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
      select 
        t.id
        ,min(t.start1) starttime
        ,min(case when t.totalmi>300 then t.end1 else null end) endtime
      from tbl2 t
      group by t.id
    )
    -- final result 
    select *
    from tbl3
    where endtime is not null
    
    0 讨论(0)
  • 2021-01-19 02:04

    This is one way to do it:

    ;WITH CTErn AS (
       SELECT activity_client_id, activity_type,
              activity_start_date, activity_end_date,
              ROW_NUMBER() OVER (PARTITION BY activity_client_id 
                                 ORDER BY activity_start_date) AS rn
       FROM activities
    ),   
    CTEdiff AS (
       SELECT c1.activity_client_id, c1.activity_type,
              x.activity_start_date, c1.activity_end_date,
              DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
              ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id 
                                 ORDER BY x.activity_start_date) AS seq
       FROM CTErn AS c1
       LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
       CROSS APPLY (SELECT CASE 
                              WHEN c1.activity_start_date < c2.activity_end_date
                                 THEN c2.activity_end_date
                              ELSE c1.activity_start_date
                           END) x(activity_start_date)    
    )
    SELECT TOP 1 client_id, client_sign_up_date, activity_start_date, 
                 hoursOfActivicty               
    FROM CTEdiff AS c1
    INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id                     
    CROSS APPLY (SELECT SUM(diff) / 60.0
                 FROM CTEdiff AS c3
                 WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
    WHERE hoursOfActivicty >= 5
    ORDER BY seq
    

    Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.

    Demo here

    The first CTE, i.e. CTErn, produces the following output:

    client_id   activity_type   start_date          end_date          rn
    112         Interview       2015-06-01 09:00    2015-06-01 11:00  1
    112         CV updating     2015-06-01 09:30    2015-06-01 11:30  2
    112         Course          2015-06-02 09:00    2015-06-02 16:00  3
    112         Interview       2015-06-03 09:00    2015-06-03 10:00  4
    

    The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:

    client_id activity_type start_date       end_date         diff  seq
    112       Interview     2015-06-01 09:00 2015-06-01 11:00 120   1
    112       CV updating   2015-06-01 11:00 2015-06-01 11:30 30    2
    112       Course        2015-06-02 09:00 2015-06-02 16:00 420   3
    112       Interview     2015-06-03 09:00 2015-06-03 10:00 60    4
    

    The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.

    The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.

    0 讨论(0)
  • 2021-01-19 02:14

    A Geometric Approach

    For another issue, I've taken a geometric approach to date packing. Namely, I convert dates and times to a sql geometry type and utilize geometry::UnionAggregate to merge the ranges.

    I don't believe this will work in sql-server 2005. But your problem was such an interesting puzzle that I wanted to see whether the geometrical approach would work. So any future users running into this problem that have access to a later version can consider it.

    Code Description

    In 'numbers':

    • I build a table representing a sequence
    • Swap it out with your favorite way to make a numbers table.
    • For a union operation, you won't ever need more rows than in your original table, so I just use it as the base to build it.

    In 'mergeLines':

    • I convert the dates to floats and use those floats to create geometrical points.
    • I then connect these points via STUnion and STEnvelope.
    • Finally, I merge all these lines via UnionAggregate. The resulting 'lines' geometry object might contain multiple lines, but if they overlap, they turn into one line.

    In 'redate':

    • I use the numbers CTE to extract the individual lines inside 'lines'.
    • I envelope the lines which here ensures that the lines are stored only as its two endpoints.
    • I read the endpoint x values and convert them back to their time representations (This is usually the end goal, but you need more).
    • I calculate the difference in minutes between activity start and end dates (I do this first in seconds then divide by 60 for the sake of a precision issue).
    • I calculate the cumulative sume of these minutes for each row.

    In the outer query:

    • I align the previous cumulative minutes sum with each current row
    • I filter for the row where the 5hr goal was met but where the previous minutes shows that the 5hr goal for the previous row was not met.
    • I then calculate where in the current row's range the user has met the 5 hours, to not only arrive at the date the five hour goal was met, but the exact time.

    The Code

    with
    
        numbers as (
    
            select  row_number() over (order by (select null)) i 
            from    @activities -- where I put your data
    
        ),
    
        mergeLines as (
    
            select      activity_client_id,
                        lines = geometry::UnionAggregate(line)
            from        @activities
            cross apply (select 
                            startP = geometry::Point(convert(float,activity_start_date), 0, 0),
                            stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
                        ) pointify
            cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
            group by    activity_client_id
    
        ),
    
        redate as (
    
            select      client_id = activity_client_id, 
                        activities_start_date,
                        activities_end_date,
                        minutes,
    
                        rollingMinutes = sum(minutes) over(
                            partition by activity_client_id 
                            order by activities_start_date 
                            rows between unbounded preceding and current row
                        )
    
            from        mergeLines ml
            join        numbers n on n.i between 1 and ml.lines.STNumGeometries()
            cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
            cross apply (select 
                            activities_start_date = convert(datetime, l.line.STPointN(1).STX),
                            activities_end_date = convert(datetime, l.line.STPointN(3).STX)
                        ) unprepare
            cross apply (select minutes = 
                            round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
                        ) duration
    
        )
    
        select      client_id,
                    activities_start_date,
                    activities_end_date,
                    met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date) 
        from        (
                        select  *,
                                prevRoll = lag(rollingMinutes) over (
                                    partition by client_id 
                                    order by rollingMinutes
                                )
                        from    redate 
                    ) ranker
        where       rollingMinutes >= 60 * 5
        and         prevRoll < 60 * 5;
    
    0 讨论(0)
提交回复
热议问题