How to replicate a SAS merge

后端 未结 2 389
一向
一向 2020-12-22 14:30

I have two tables, t1 and t2:

t1
  person | visit | code1 | type1
       1       1      50      50 
       1       1      50      50 
       1       2      7         


        
相关标签:
2条回答
  • 2020-12-22 15:14

    You can replicate a SAS merge by adding a row_number() to each table:

    select t1.*, t2.*
    from (select t1.*,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t1
         ) t1 full outer join
         (select t2.*,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t2
         ) t2
         on t1.person = t2.person and t1.visit = t2.visit and
            t1.seqnum = t2.seqnum;
    

    Notes:

    • The ?? means to put in the column(s) used for ordering. SAS datasets have an intrinsic order. SQL tables do not, so the ordering needs to be specified.
    • You should list the columns explicitly (instead of using t1.*, t2.* in the outer query). I think SAS only includes person and visit once in the resulting dataset.

    EDIT:

    Note: the above produces separate values for the key columns. This is easy enough to fix:

    select coalesce(t1.person, t2.person) as person,
           coalesce(t1.key, t2.key) as key,
           t1.code1, t1.type1, t2.code2, t2.type2
    from (select t1.*,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t1
         ) t1 full outer join
         (select t2.*,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t2
         ) t2
         on t1.person = t2.person and t1.visit = t2.visit and
            t1.seqnum = t2.seqnum;
    

    That fixes the columns issue. You can fix the copying issue by using first_value()/last_value() or by using a more complicated join condition:

    select coalesce(t1.person, t2.person) as person,
           coalesce(t1.visit, t2.visit) as visit,
           t1.code1, t1.type1, t2.code2, t2.type2
    from (select t1.*,
                 count(*) over (partition by person, visit) as cnt,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t1
         ) t1 full outer join
         (select t2.*,
                 count(*) over (partition by person, visit) as cnt,
                 row_number() over (partition by person, visit order by ??) as seqnum
          from t2
         ) t2
         on t1.person = t2.person and t1.visit = t2.visit and
            (t1.seqnum = t2.seqnum or
            (t1.cnt > t2.cnt and t1.seqnum > t2.seqnum and t2.seqnum = t2.cnt) or
            (t2.cnt > t1.cnt and t2.seqnum > t1.seqnum and t1.seqnum = t1.cnt)
    

    This implements the "keep the last row" logic in a single join. Probably for performance reasons, you would want to put this into separate left joins on the original logic.

    0 讨论(0)
  • 2020-12-22 15:30

    Gordon's answer is close; but it misses one point. Here's its output:

    person  visit   code1   type1   seqnum  person  visit   code2   type2   seqnum
    1       1       1       1       1       1       1       1       1       1
    1       1       2       2       2       1       1       2       2       2
    NULL    NULL    NULL    NULL    NULL    1       1       3       3       3
    1       2       1       3       1       NULL    NULL    NULL    NULL    NULL
    

    The third row's nulls are incorrect, while the fourth's are correct.

    As far as I know, in SQL there's not a really good way to do this other than splitting things up into a few queries. I think there are five possibilities:

    • Matching person/visit, Matching seqnums
    • Matching person/visit, Left has more seqnums
    • Matching person/visit, Right has more seqnums
    • Left has unmatched person/visit
    • Right has unmatched person/visit

    I think the last two might be workable into one query, but I think the second and third have to be separate queries. You can union everything together, of course.

    So here's an example, using some temporary tables that are a little more well suited to see what's going on. Note that the third row is now filled in for code1 and type1, even though those are 'extra'. I've only added three of the five criteria - the three you had in your initial example - but the other two aren't too hard.

    Note that this is an example of something far faster in SAS - because SAS has a row-wise concept, ie, it's capable of going one row at a time. SQL tends to take a lot longer at these, with large tables, unless it's possible to partition things very neatly and have very good indexes - and even then I've never seen a SQL DBA do anywhere near as well as SAS at some of these types of problems. That's something you'll have to accept of course - SQL has its own advantages, one of which being probably price...

    Here's my example code. I'm sure it's not terribly elegant, hopefully one of the SQL folk can improve it. This is written to work in SQL Server (using table variables), same thing should work with some changes (to use temporary tables) in other variants, assuming they implement windowing. (SAS of course can't do this particular thing - as even FedSQL implements ANSI 1999, not ANSI 2008.) This is based on Gordon's initial query, then modified with the additional bits at the end. Anyone who wants to improve this please feel free to edit and/or copy to a new/existing answer any bit you wish.

    declare @t1 table (person INT, visit INT, code1 INT, type1 INT);
    declare @t2 table (person INT, visit INT, code2 INT, type2 INT);
    
    
    insert into @t1 values (1,1,1,1)
    insert into @t1 values (1,1,2,2)
    insert into @t1 values (1,2,1,3)
    
    insert into @t2 values (1,1,1,1)
    insert into @t2 values (1,1,2,2)
    insert into @t2 values (1,1,3,3)
    
    select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                    t1.code1, t1.type1, t2.code2, t2.type2
    from (select *,
                 row_number() over (partition by person, visit order by type1) as seqnum
          from @t1
         ) t1 inner join
         (select *,
                 row_number() over (partition by person, visit order by type2) as seqnum
          from @t2
         ) t2
         on t1.person = t2.person and t1.visit = t2.visit and
            t1.seqnum = t2.seqnum
     union all
    
    select coalesce(t1.person, t2.person) as person, coalesce(t1.visit, t2.visit) as visit,
                    t1.code1, t1.type1, t2.code2, t2.type2
    from (
          (select person, visit, MAX(seqnum) as max_rownum from (
            select person, visit, 
                 row_number() over (partition by person, visit order by type1) as seqnum
          from @t1) t1_f 
          group by person, visit
         ) t1_m inner join
         (select *, row_number() over (partition by person, visit order by type1) as seqnum
           from @t1
          ) t1 
            on t1.person=t1_m.person and t1.visit=t1_m.visit
            and t1.seqnum=t1_m.max_rownum
            inner join
         (select *,
                 row_number() over (partition by person, visit order by type2) as seqnum
          from @t2
         ) t2
         on t1.person = t2.person and t1.visit = t2.visit and
            t1.seqnum < t2.seqnum 
         )
     union all
     select t1.person, t1.visit, t1.code1, t1.type1, t2.code2, t2.type2
         from @t1 t1 left join @t2 t2
        on t2.person=t1.person and t2.visit=t1.visit
        where t2.code2 is null
    
    0 讨论(0)
提交回复
热议问题