I need to join table A and table B to create table C.
Table A and Table B store status flags for the IDs. The status flags (A_Flag and B_Flag) can change from time to ti
I'm going to solve this in SQL, assuming that you have a function called lag
(SQL Server 2012, Oracle, Postgres, DB2). You can get the same effect with a correlated subquery.
The idea is to get all the different time periods. Then join back to the original tables to get the flags.
I am having trouble uploading the code, but can get most of it. However, it starts with start ends, which you create by doing a union
(not union all
) of the four dates in one column: select a.start as thedate. This is then union'ed with a.end, b.start, and b.end.
with driver as (
select thedate as start, lag(thedate) over (order by thedate) as end
from startends
)
select startdate, enddate, a.flag, b.flag
from driver left outer join
a
on a.start >= driver.start and a.end <= driver.end left outer join
b
on b.start >= driver.start and b.end <= driver.end
This type of sequential processing with shifts and offsets is one of the situations where the SAS DATA step shines. Not that this answer is simple, but it is simpler than using SQL, which can be done, but isn't designed with this sequential processing in mind.
Furthermore, solutions based on DATA step tend to be very efficient. This one runs in time O(n log n) in theory, but closer to O(n) in practice, and in constant space.
The first two DATA steps are just loading data, slightly modified from Joe's answer, to have multiple IDs (otherwise the syntax is MUCH easier) and to add some corner cases, i.e., an ID for which it is impossible to determine initial state.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
2 23/03/2008 15/06/2008 0
2 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
2 17/02/2008 15/06/2008 0
4 15/06/2008 18/08/2008 1
;;;;
run;
The next data step finds the first modification for each id and flag and sets the initial value to the opposite of what it found.
/* Get initial state by inverting first change */
data firstA;
set tableA;
by id;
if first.id;
A_Flag = ~A_Flag;
run;
data firstB;
set tableB;
by id;
if first.id;
B_Flag = ~B_Flag;
run;
data first;
merge firstA firstB;
by id;
run;
The next data step merges the artificial "first" table with the other two, retaining the last state known and discarding the artificial initial row.
data tableAB (drop=lastA lastB);
set first tableA tableB;
by id start;
retain lastA lastB lastStart;
if A_flag = . and ~first.id then A_flag = lastA;
else lastA = A_flag;
if B_flag = . and ~first.id then B_flag = lastB;
else lastB = B_flag;
if ~first.id; /* drop artificial first row per id */
run;
The steps above do almost everything. The only bug is that the end dates will be wrong, because they are copied from the original row. To fix that, copy the next start to each row's end, unless it is a final row. The easiest way is to sort each id by reverse start, look back one record, then sort ascending again at the end.
/* sort descending to ... */
proc sort data=tableAB;
by id descending start;
run;
/* ... copy next start to this row's "end" field if not final */
data tableAB(drop=nextStart);
set tableAB;
by id descending start;
nextStart=lag(start);
if ~first.id then end=nextStart;
run;
proc sort data=tableAB;
by id start;
run;
One possible SAS solution to this is to perform a partial join, and then create the necessary additional rows in the data step. This should work assuming tableA has all possible records; if that's not the case (if tableB can start before tableA), some additional logic may be needed to consider that possibility (if first.id and start gt b_start). There may also be additional logic needed for issues not present in the example data - I don't have a lot of time this morning and didn't debug this for anything beyond the example data cases, but the concept should be evident.
data tableA;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End A_Flag;
datalines;
1 01/01/2008 23/03/2008 1
1 23/03/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
data tableB;
informat start end DDMMYY10.;
format start end DATE9.;
input ID Start End B_Flag;
datalines;
1 19/01/2008 17/02/2008 1
1 17/02/2008 15/06/2008 0
1 15/06/2008 18/08/2008 1
;;;;
run;
proc sql;
create table c_temp as
select * from tableA A
left join (select id, start as b_start, end as b_end, b_flag from tableB) B
on A.Id = B.id
where (A.start le B.b_start and A.end gt B.b_start) or (A.start lt B.b_end and A.end ge B.b_end)
order by A.ID, A.start, B.b_start;
quit;
data tableC;
set c_temp;
by id start;
retain b_flag_ret;
format start_fin end_fin DATE9.;
if first.id then b_flag_ret=0;
do until (start=end);
if (start lt b_start) and first.start then do;
start_fin=start;
end_fin=b_start;
a_flag_fin=a_flag;
b_flag_fin=b_flag_ret;
output;
start=b_start;
end;
else do; *start=b_start;
start_fin=ifn(start ge b_start, start, b_start);
end_fin = ifn(b_end le end, b_end, end);
a_flag_fin=a_flag;
b_flag_fin=b_flag;
output;
start=end; *leave the loop as there will be a later row that matches;
end;
end;
run;
The problem you posed can be solved in one SQL statement without nonstandard extensions.
The most important thing to recognize is that the dates in the begin-end pairs each represent a potential starting or ending point of a time span during which the flag pair will be true. It actually doesn't matter that one date is a "begin" and another and "end"; any date is a time delimiter that does both: it ends a prior period and begins another. Construct a set of minimal time intervals, and join them to the tables to find the flags that obtained during each interval.
I added your example (and a solution) to my Canonical SQL page. See there for a detailed discussion. In fairness to SO, here's the query itself
with D (ID, bound) as (
select ID
, case T when 's' then StartDate else EndDate end as bound
from (
select ID, StartDate, EndDate from so.A
UNION
select ID, StartDate, EndDate from so.B
) as U
cross join (select 's' as T union select 'e') as T
)
select P.*, a.Flag as A_Flag, b.Flag as B_Flag
from (
select s.ID, s.bound as StartDate, min(e.bound) as EndDate
from D as s join D as e
on s.ID = e.ID
and s.bound < e.bound
group by s.ID, s.bound
) as P
left join so.A as a
on P.ID = a.ID
and a.StartDate <= P.StartDate and P.EndDate <= a.EndDate
left join so.B as b
on P.ID = b.ID
and b.StartDate <= P.StartDate and P.EndDate <= b.EndDate
order by P.ID, P.StartDate, P.EndDate