Join overlapping date ranges

前端 未结 4 1392
半阙折子戏
半阙折子戏 2021-02-08 06:16

I need to join table A and table B to create table C.

Table A and Table B store status flags for the IDs. The status flags (A_Flag and B_Flag) can change from time to ti

4条回答
  •  孤城傲影
    2021-02-08 06:27

    This type of sequential processing with shifts and offsets is one of the situations where the SAS DATA step shines. Not that this answer is simple, but it is simpler than using SQL, which can be done, but isn't designed with this sequential processing in mind.

    Furthermore, solutions based on DATA step tend to be very efficient. This one runs in time O(n log n) in theory, but closer to O(n) in practice, and in constant space.

    The first two DATA steps are just loading data, slightly modified from Joe's answer, to have multiple IDs (otherwise the syntax is MUCH easier) and to add some corner cases, i.e., an ID for which it is impossible to determine initial state.

    data tableA;
    informat start end DDMMYY10.;
    format start end DATE9.;
    input ID  Start           End     A_Flag;
    datalines;
    1   01/01/2008  23/03/2008  1
    2   23/03/2008  15/06/2008  0
    2   15/06/2008  18/08/2008  1
    ;;;;
    run;
    
    data tableB;
    informat start end DDMMYY10.;
    format start end DATE9.;
    input ID  Start           End     B_Flag;
    datalines;
    1   19/01/2008  17/02/2008  1
    2   17/02/2008  15/06/2008  0
    4   15/06/2008  18/08/2008  1
    ;;;;
    run;
    

    The next data step finds the first modification for each id and flag and sets the initial value to the opposite of what it found.

    /* Get initial state by inverting first change */
    data firstA;
        set tableA;
        by id;
        if first.id;
        A_Flag = ~A_Flag;
    run;
    
    data firstB;
        set tableB;
        by id;
        if first.id;
        B_Flag = ~B_Flag;
    run;
    data first;
        merge firstA firstB;
        by id;
    run;
    

    The next data step merges the artificial "first" table with the other two, retaining the last state known and discarding the artificial initial row.

    data tableAB (drop=lastA lastB);
       set first tableA tableB;
       by id start;
       retain lastA lastB lastStart;
       if A_flag = . and ~first.id then A_flag = lastA;
       else lastA = A_flag;
       if B_flag = . and ~first.id then B_flag = lastB;
       else lastB = B_flag;
       if ~first.id;  /* drop artificial first row per id */
    run;
    

    The steps above do almost everything. The only bug is that the end dates will be wrong, because they are copied from the original row. To fix that, copy the next start to each row's end, unless it is a final row. The easiest way is to sort each id by reverse start, look back one record, then sort ascending again at the end.

    /* sort descending to ... */
    proc sort data=tableAB;
       by id descending start;
    run;
    /* ... copy next start to this row's "end" field if not final */
    data tableAB(drop=nextStart);
       set tableAB;
       by id descending start;
       nextStart=lag(start);
       if ~first.id then end=nextStart;
    run;
    
    proc sort data=tableAB;
       by id start;
    run;
    

提交回复
热议问题