问题
I have a table with the following columns and data. Data describes certain customer activity periods
cust_id s_date e_date
11111 01.03.2014 31.03.2014
11111 10.04.2014 30.04.2014
11111 01.05.2014 10.05.2014
11111 15.06.2014 31.07.2014
22222 01.04.2014 31.05.2014
22222 01.06.2014 30.06.2014
22222 01.07.2014 15.07.2014
And I want to write a query which gives this result:
cust_id s_date e_date
11111 01.03.2014 10.05.2014
11111 15.06.2014 31.07.2014
22222 01.04.2014 15.07.2014
The query result purpose is to "merge" rows into one row when customer IN-activity period is less than 15 days. I can handle with "1 row preceding" but if needed to merge 3 or more rows then it does not work. I run out of ideas how to write this query.
My "half" 1 row preceding query:
SELECT cust_id
, start_date as current_period_start_date
, end_date as current_period_end_date
, end_date+15 as current_period_expired_date
, coalesce(
min(current_period_expire_date)
over(partition by cust_id
order by start_date
rows between 1 preceding and 1 preceding)
, cast('1900-01-01' as date)) as previous_period_expire_date
, case
when current_period_start_date <= previous_period_expire_date
then min(current_period_start_date)
over(partition by cust_id
order by start_date
rows between 1 preceding and current row)
else current_period_start_date
end as new_current_period_start_date
FROM MY_DB.my_table
. . .
Also, is it possible to change preceding into dynamical way like this?
... over(partition by ... order by ... rows between X preceding and current row)
回答1:
Gordon's answer can be modified as the basic LAG syntax is easily rewritten:
LAG(col, n) OVER (ORDER BY c)
is the same as a
MIN(col) OVER (ORDER BY c ROWS BETWEEN n PRECEDING AND n PRECEDING)
The possible default value as a 3rd parameter can be done using a COALESCE(LAG...., default value), only the IGNORE NULLS option is a really tough one.
This results in:
SELECT cust_id, MIN(s_date) AS s_date, MAX(e_date) AS e_date
FROM (SELECT t.*, SUM(GroupStartFlag) OVER (PARTITION BY cust_id ORDER BY s_date ROWS UNBOUNDED PRECEDING) AS grpid
FROM (SELECT cust_id, s_date, e_date,
(CASE WHEN s_date <= MIN(e_date)
OVER (PARTITION BY cust_id
ORDER BY s_date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) + 15
THEN 0
ELSE 1
END) AS GroupStartFlag
FROM vt
) t
) t
GROUP BY cust_id, grpid;
If you don't need any additional columns (just the cust_id and the dates) you can also utilize a specific TD 13.10 table function for normalizing periods. To include the 15 day difference you might simply subtract/add 15 days:
WITH cte (cust_id, pd)
AS
( SELECT cust_id, PERIOD(s_date-15, e_date) AS pd
FROM vt
)
SELECT cust_id,
BEGIN(pd)+15,
END(pd),
cnt
FROM TABLE (TD_NORMALIZE_OVERLAP_MEET
(NEW VARIANT_TYPE(cte.cust_id)
,cte.pd)
RETURNS (cust_id INTEGER
,pd PERIOD(DATE)
,cnt INTEGER) --optional: number of rows normalized in one result row
HASH BY cust_id
LOCAL ORDER BY cust_id, pd
) AS t;
In TD 14.10 there's also a real nice syntax for normalizing periods:
SELECT cust_id, BEGIN (pd)+15, END(pd)
FROM
(
SELECT NORMALIZE
cust_id, PERIOD(s_date-15, e_date) AS pd
FROM vt
) AS dt
Btw, periods are defined with an inclusive start but exclusive end (i.e. for a gapless period end of the previous period and start of the next got the same value), so you might have to change 15 to 16 to get the desired result.
回答2:
I would approach this using the lag()
function. This function can be used to identify each row that begins a new period. Then when this flag is cumulatively summed, it provides a group identifier. Here is what the code looks like:
select cust_id, min(s_date) as s_date, max(e_date) as e_date
from (select t.*, sum(GroupStartFlag) over (partition by cust_id order by s_date rows unbounded preceding) as grpid
from (select cust_id, s_date, e_date,
(case when s_date <= lag(e_date) over (partition by cust_id order by s_date) + 15
then 0
else 1
end) as GroupStartFlag
from MY_DB.my_table
) t
) t
group by cust_id, grpid;
Note: Teradata supports window functions but sometimes has odd requirements on them. I think the above will work directly, but I have no system to test it on.
EDIT:
I'm not sure if Teradata supports the lag()
function. You can do the equivalent with a correlated subquery:
select cust_id, min(s_date) as s_date, max(e_date) as e_date
from (select t.*,
sum(case when s_date <= prev_edate + 15 then 0 else 1 end) over
(partition by cust_id order by s_date rows unbounded preceding) as grpid
from (select cust_id, s_date, e_date,
(select max(e_date)
from MY_DB.my_table t2
where t2.cust_id = t.cust_id and
t2.s_date < t.s_date
) as prev_edate
from MY_DB.my_table t
) t
) t
group by cust_id, grpid;
来源:https://stackoverflow.com/questions/25167738/teradata-partitioned-query-following-rows-dynamically