SQL Query to Collapse Duplicate Values By Date Range

前端 未结 4 2080
后悔当初
后悔当初 2020-12-12 23:19

I have a table with the following structure: ID, Month, Year, Value with values for one entry per id per month, most months have the same value.

I would like to crea

相关标签:
4条回答
  • 2020-12-12 23:32

    I got it to work as follows. It is heavy on analytic functions and is Oracle specific.

    select distinct id, value,
    decode(startMonth, null,
      lag(startMonth) over(partition by id, value order by startMonth, endMonth),  --if start is null, it's an end so take from the row before
    startMonth) startMonth,
    
      decode(endMonth, null,
      lead(endMonth) over(partition by id, value order by startMonth, endMonth),  --if end is null, it's an start so take from the row after
    endMonth) endMonth    
    
    from (
    select id, value, startMonth, endMonth from(
    select id, value, 
    decode(month+1, lead(month) over(partition by id,value order by month), null, month)     
    startMonth, --get the beginning month for each interval
    decode(month-1, lag(month) over(partition by id,value order by month), null, month)     
    endMonth --get the end month for each interval from Tbl
    ) a 
    where startMonth is not null or endMonth is not null --remain with start and ends only
    )b
    

    It might be possible to simplify some of the inner queries somewhat

    The inner query checks if the month is a first/last month of the interval as follows: if the month + 1 == the next month (lag) for that grouping, then since there is a next month, this month is obviously not the end month. Otherwise, it is the last month of the interval. The same concept is used to check for the first month.

    The outer query first filters out all rows that are not either start or end months (where startMonth is not null or endMonth is not null). Then, each row is either a start month or an end month (or both), determined by whether start or end is not null). If the month is a start month, get the corresponding end month by getting the next (lead) endMonth for that id,value ordered by endMonth, and if it is an endMonth get the startMonth by looking for the previous startMonth (lag)

    0 讨论(0)
  • 2020-12-12 23:33

    I couldn't get the response from ngz to work when the input table contains multiple ids and date ranges that span years. I have a solution that does work, but with qualifications. It will only give you the correct answers if you know that you have a row for every month/year/id combination within the range. If there are "holes" it won't work. If you have holes, I know of know good way to do it other than writing some PL/SQL and using a cursor loop to create a new table in the format you want.

    By the way, this is why data modeled this way is an abomination. You should always store stuff as start/from range records, not as discrete time period records. It's trivial to transform the former into the latter with a "multiplier" table, but it's almost impossible (as you've seen) to go the other direction.

    SELECT ID
         , VALUE
         , start_date
         , end_date
      FROM (SELECT ID
                 , VALUE
                 , start_date
                 , CASE
                      WHEN is_last = 0
                         THEN LEAD(end_date) OVER(PARTITION BY ID ORDER BY start_date)
                      ELSE end_date
                   END end_date
                 , is_first
              FROM (SELECT ID
                         , VALUE
                         , TO_CHAR(the_date, 'YYYY.MM') start_date
                         , TO_CHAR(NVL(LEAD(the_date - 31) OVER(PARTITION BY ID ORDER BY YEAR
                                      , MONTH), the_date), 'YYYY.MM') end_date
                         , is_first
                         , is_last
                      FROM (SELECT ID
                                 , YEAR
                                 , MONTH
                                 , TO_DATE(TO_CHAR(YEAR) || '.' || TO_CHAR(MONTH) || '.' || '15', 'YYYY.MM.DD') the_date
                                 , VALUE
                                 , ABS(SIGN(VALUE -(NVL(LAG(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                       , MONTH), VALUE - 1)))) is_first
                                 , ABS(SIGN(VALUE -(NVL(LEAD(VALUE) OVER(PARTITION BY ID ORDER BY YEAR
                                                       , MONTH), VALUE - 1)))) is_last
                              FROM test_table)
                     WHERE is_first = 1
                        OR is_last = 1))
     WHERE is_first = 1
    
    0 讨论(0)
  • 2020-12-12 23:46

    This one uses only one table scan and works across years. It's better though to model your month and year column as only one date datatype column:

    SQL> create table tbl (id,month,year,value)
      2  as
      3  select 100,12,2007,80 from dual union all
      4  select 100,1,2008,80 from dual union all
      5  select 100,2,2008,80 from dual union all
      6  select 100,3,2008,90 from dual union all
      7  select 100,4,2008,80 from dual union all
      8  select 200,12,2007,50 from dual union all
      9  select 200,1,2008,50 from dual union all
     10  select 200,2,2008,40 from dual union all
     11  select 200,3,2008,50 from dual union all
     12  select 200,4,2008,50 from dual union all
     13  select 200,5,2008,50 from dual
     14  /
    
    Tabel is aangemaakt.
    
    SQL> select id
      2       , mod(min(year*12+month-1),12)+1 startmonth
      3       , trunc(min(year*12+month-1)/12) startyear
      4       , mod(max(year*12+month-1),12)+1 endmonth
      5       , trunc(max(year*12+month-1)/12) endyear
      6       , value
      7    from ( select id
      8                , month
      9                , year
     10                , value
     11                , max(rn) over (partition by id order by year,month) maxrn
     12             from ( select id
     13                         , month
     14                         , year
     15                         , value
     16                         , case lag(value) over (partition by id order by year,month)
     17                           when value then null
     18                           else rownum
     19                           end rn
     20                      from tbl
     21                  ) inner
     22         )
     23   group by id
     24       , maxrn
     25       , value
     26   order by id
     27       , startyear
     28       , startmonth
     29  /
    
            ID STARTMONTH  STARTYEAR   ENDMONTH    ENDYEAR      VALUE
    ---------- ---------- ---------- ---------- ---------- ----------
           100         12       2007          2       2008         80
           100          3       2008          3       2008         90
           100          4       2008          4       2008         80
           200         12       2007          1       2008         50
           200          2       2008          2       2008         40
           200          3       2008          5       2008         50
    
    6 rijen zijn geselecteerd.
    

    Regards, Rob.

    0 讨论(0)
  • 2020-12-12 23:54

    I'm going to develop my solution incrementally, decomposing each transformation into a view. This both helps explain what's being done, and helps in debugging and testing. It's essentially applying the principle of functional decomposition to database queries.

    I'm also going to do it without using Oracle extensions, with SQL that ought to run on any modern RBDMS. So no keep, over, partition, just subqueries and group bys. (Inform me in the comments if it doesn't work on your RDBMS.)

    First, the table, which since I'm uncreative, I'll call month_value. Since the id is not actually a unique id, I'll call it "eid". The other columns are "m"onth, "y"ear, and "v"alue:

    create table month_value( 
       eid int not null, m int, y int,  v int );
    

    After inserting the data, for two eids, I have:

    > select * from month_value;
    +-----+------+------+------+
    | eid | m    | y    | v    |
    +-----+------+------+------+
    | 100 |    1 | 2008 |   80 |
    | 100 |    2 | 2008 |   80 |
    | 100 |    3 | 2008 |   90 |
    | 100 |    4 | 2008 |   80 |
    | 200 |    1 | 2008 |   80 |
    | 200 |    2 | 2008 |   80 |
    | 200 |    3 | 2008 |   90 |
    | 200 |    4 | 2008 |   80 |
    +-----+------+------+------+
    8 rows in set (0.00 sec)
    

    Next, we have one entity, the month, that's represented as two variables. That should really be one column (either a date or a datetime, or maybe even a foreign key to a table of dates), so we'll make it one column. We'll do that as a linear transform, such that it sorts the same as (y, m), and such that for any (y,m) tuple there is one and only value, and all values are consecutive:

    > create view cm_abs_month as 
    select *, y * 12 + m as am from month_value;
    

    That gives us:

    > select * from cm_abs_month;
    +-----+------+------+------+-------+
    | eid | m    | y    | v    | am    |
    +-----+------+------+------+-------+
    | 100 |    1 | 2008 |   80 | 24097 |
    | 100 |    2 | 2008 |   80 | 24098 |
    | 100 |    3 | 2008 |   90 | 24099 |
    | 100 |    4 | 2008 |   80 | 24100 |
    | 200 |    1 | 2008 |   80 | 24097 |
    | 200 |    2 | 2008 |   80 | 24098 |
    | 200 |    3 | 2008 |   90 | 24099 |
    | 200 |    4 | 2008 |   80 | 24100 |
    +-----+------+------+------+-------+
    8 rows in set (0.00 sec)
    

    Now we'll use a self-join in a correlated subquery to find, for each row, the earliest successor month in which the value changes. We'll base this view on the previous view we created:

    > create view cm_last_am as 
       select a.*, 
        ( select min(b.am) from cm_abs_month b 
          where b.eid = a.eid and b.am > a.am and b.v <> a.v) 
       as last_am 
       from cm_abs_month a;
    
    > select * from cm_last_am;
    +-----+------+------+------+-------+---------+
    | eid | m    | y    | v    | am    | last_am |
    +-----+------+------+------+-------+---------+
    | 100 |    1 | 2008 |   80 | 24097 |   24099 |
    | 100 |    2 | 2008 |   80 | 24098 |   24099 |
    | 100 |    3 | 2008 |   90 | 24099 |   24100 |
    | 100 |    4 | 2008 |   80 | 24100 |    NULL |
    | 200 |    1 | 2008 |   80 | 24097 |   24099 |
    | 200 |    2 | 2008 |   80 | 24098 |   24099 |
    | 200 |    3 | 2008 |   90 | 24099 |   24100 |
    | 200 |    4 | 2008 |   80 | 24100 |    NULL |
    +-----+------+------+------+-------+---------+
    8 rows in set (0.01 sec)
    

    last_am is now the "absolute month" of the first (earliest) month (after the month of the current row) in which the value, v, changes. It's null where there is no later month, for that eid, in the table.

    Since last_am is the same for all months leading up to the change in v (which occurs at last_am), we can group on last_am and v (and eid, of course), and in any group, the min(am) is the absolute month of the first consecutive month that had that value:

    > create view cm_result_data as 
      select eid, min(am) as am , last_am, v 
      from cm_last_am group by eid, last_am, v;
    
    > select * from cm_result_data;
    +-----+-------+---------+------+
    | eid | am    | last_am | v    |
    +-----+-------+---------+------+
    | 100 | 24100 |    NULL |   80 |
    | 100 | 24097 |   24099 |   80 |
    | 100 | 24099 |   24100 |   90 |
    | 200 | 24100 |    NULL |   80 |
    | 200 | 24097 |   24099 |   80 |
    | 200 | 24099 |   24100 |   90 |
    +-----+-------+---------+------+
    6 rows in set (0.00 sec)
    

    Now this is the result set we want, which is why this view is called cm_result_data. All that's lacking is something to transform absolute months back to (y,m) tuples.

    To do that, we'll just join to the table month_value.

    There are only two problems: 1) we want the month before last_am in our output, and 2) we have nulls where there is no next month in our data; to met the OP's specification, those should be single month ranges.

    EDIT: These could actually be longer ranges than one month, but in every case they mean we need to find the latest month for the eid, which is:

    (select max(am) from cm_abs_month d where d.eid = a.eid )
    

    Because the views decompose the problem, we could add in this "end cap" month earlier, by adding another view, but I'll just insert this into the coalesce. Which would be most efficient depends on how your RDBMS optimizes queries.

    To get month before, we'll join (cm_result_data.last_am - 1 = cm_abs_month.am)

    Wherever we have a null, the OP wants the "to" month to be the same as the "from" month, so we'll just use coalesce on that: coalesce( last_am, am). Since last eliminates any nulls, our joins don't need to be outer joins.

    > select a.eid, b.m, b.y, c.m, c.y, a.v 
       from cm_result_data a 
        join cm_abs_month b 
          on ( a.eid = b.eid and a.am = b.am)  
        join cm_abs_month c 
          on ( a.eid = c.eid and 
          coalesce( a.last_am - 1, 
                  (select max(am) from cm_abs_month d where d.eid = a.eid )
          ) = c.am)
        order by 1, 3, 2, 5, 4;
    +-----+------+------+------+------+------+
    | eid | m    | y    | m    | y    | v    |
    +-----+------+------+------+------+------+
    | 100 |    1 | 2008 |    2 | 2008 |   80 |
    | 100 |    3 | 2008 |    3 | 2008 |   90 |
    | 100 |    4 | 2008 |    4 | 2008 |   80 |
    | 200 |    1 | 2008 |    2 | 2008 |   80 |
    | 200 |    3 | 2008 |    3 | 2008 |   90 |
    | 200 |    4 | 2008 |    4 | 2008 |   80 |
    +-----+------+------+------+------+------+
    

    By joining back we get the output the OP wants.

    Not that we have to join back. As it happens, our absolute_month function is bi-directional, so we can just recalculate the year and offset month from it.

    First, lets take care of adding the "end cap" month:

    > create or replace view cm_capped_result as 
    select eid, am, 
      coalesce( 
       last_am - 1, 
       (select max(b.am) from cm_abs_month b where b.eid = a.eid)
      ) as last_am, v  
     from cm_result_data a;
    

    And now we get the data, formatted per the OP:

    select eid, 
     ( (am - 1) % 12 ) + 1 as sm, 
     floor( ( am - 1 ) / 12 ) as sy, 
     ( (last_am - 1) % 12 ) + 1 as em, 
     floor( ( last_am - 1 ) / 12 ) as ey, v    
    from cm_capped_result 
    order by 1, 3, 2, 5, 4;
    
    +-----+------+------+------+------+------+
    | eid | sm   | sy   | em   | ey   | v    |
    +-----+------+------+------+------+------+
    | 100 |    1 | 2008 |    2 | 2008 |   80 |
    | 100 |    3 | 2008 |    3 | 2008 |   90 |
    | 100 |    4 | 2008 |    4 | 2008 |   80 |
    | 200 |    1 | 2008 |    2 | 2008 |   80 |
    | 200 |    3 | 2008 |    3 | 2008 |   90 |
    | 200 |    4 | 2008 |    4 | 2008 |   80 |
    +-----+------+------+------+------+------+
    

    And there's the data the OP wants. All in SQL that should run on any RDBMS, and is decomposed into simple, easy to understand and easy to test views.

    Is is better to rejoin or to recalculate? I'll leave that (it's a trick question) to the reader.

    (If your RDBMS doesn't allow group bys in views, you'll have to join first and then group, or group and then pull in the month and year with correlated subqueries. This is left as an exercise for the reader.)


    Jonathan Leffler asks in the comments,

    What happens with your query if there are gaps in the data (say there's an entry for 2007-12 with value 80, and another for 2007-10, but not one for 2007-11? The question isn't clear what should happen there.

    Well, you're exactly right, the OP doesn't specify. Perhaps there's an (unmentioned) pre-condition that there are no gaps. In the absence of a requirement, we shouldn't try to code around something that might not be there. But, the fact is, gaps make the "joining back" strategy fail; the "recalculate" strategy doesn't fail under those conditions. I'd say more, but that would reveal the trick in the trick question I alluded to above.

    0 讨论(0)
提交回复
热议问题