问题
First I would like to say that I am new to the stackoverflow community and relatively new to SQL itself and so please pardon me If I didn't format my question right or didn't state my requirements clearly.
I am trying to implement a type 2 SCD in Oracle. The structure of the source table (customer_records
) is given below.
CREATE TABLE customer_records(
day date,
snapshot_day number,
vendor_id number,
customer_id number,
rank number
);
INSERT INTO customer_records
(day,snapshot_day,vendor_id,customer_id,rank)
VALUES
(9/24/2014,6266,71047795,476095,3103),
(10/1/2014,6273,71047795,476095,3103),
(10/8/2014,6280,71047795,476095,3103),
(10/15/2014,6287,71047795,476095,3103),
(10/22/2014,6291,71047795,476095,3102),
(10/29/2014,6330,71047795,476095,3102),
(11/05/2015,6351,71047795,476095,3102),
(11/12/2015,6440,71047795,476095,3103);
The above table is updated weekly and I have pulled records for a particular customer represented by vendor_id
and customer_id
. Such that each customer will have a unique vendor_id
and customer_id
. I am trying to track the changes in the tier (rank
) of a customer. It may so happen that the customer's tier may remain same for several weeks and we are only willing to track when there is a change in the tier of the customer.
The desired output (dimension table) would look something like this:
SK Version Date_From Date_To Vendor_id Customer_Id Rank_Id
1 1 9/24/2014 10/22/2014 71047795 476095 3103
2 2 10/22/2014 11/05/2015 71047795 476095 3102
3 3 11/05/2015 12/31/2199 71047795 476095 3103
Such that whenever customer's tier hit a change we track that in a new table. Also, wanting to include the current_flag
= 'Y' for the most current tier.
I want to be able to do it using merge.
回答1:
Here is an approach to group consecutive records having the same tier, while detecting changes.
The idea is to self-join the table, and to relate each record to the next record that has a different tier. This is done using a NOT EXISTS
condition with a correlated subquery.
LEFT JOIN
is needed, to avoid filtering out the last record (that owns the current tier), which does not have a next record yet : for this record, we use COALESCE()
to set up a default end date.
SELECT
c1.day day_from,
COALESCE(c2.day, TO_DATE('2199-12-31', 'yyyy-mm-dd')) day_to,
c1.Vendor_ID,
c1.Customer_ID,
c1.rank
FROM customer_records c1
LEFT JOIN customer_records c2
ON c2.Vendor_ID = c1.Vendor_ID
AND c2.Customer_ID = c1.Customer_ID
AND c2.rank <> c1.rank
AND c2.DAY > c1.DAY
AND NOT EXISTS (
SELECT 1
FROM customer_records c3
WHERE
c3.Vendor_ID = c1.Vendor_ID
AND c3.Customer_ID = c1.Customer_ID
AND c3.rank <> c1.rank
AND c3.DAY > c1.DAY
AND c3.DAY < c2.DAY
)
This returns :
DAY_FROM | DAY_TO | Vendor_ID | Customer_ID | rank
:-------- | :-------- | ------------------: | ----------: | -----------------:
24-SEP-14 | 22-OCT-14 | 71047795 | 476095 | 3103
01-OCT-14 | 22-OCT-14 | 71047795 | 476095 | 3103
08-OCT-14 | 22-OCT-14 | 71047795 | 476095 | 3103
15-OCT-14 | 22-OCT-14 | 71047795 | 476095 | 3103
22-OCT-14 | 12-NOV-15 | 71047795 | 476095 | 3102
29-OCT-14 | 12-NOV-15 | 71047795 | 476095 | 3102
05-NOV-15 | 12-NOV-15 | 71047795 | 476095 | 3102
12-NOV-15 | 31-DEC-99 | 71047795 | 476095 | 3103
Now we can group the record set by tier and end date to generate the expected results. ROW_NUMBER()
can give you the version number. It is also easy to check which record is the current one, as explained above.
SELECT
ROW_NUMBER() OVER(ORDER BY c2.day) version,
DECODE(c2.day, NULL, 'Y') current_flag,
MIN(c1.day) day_from,
COALESCE(c2.day, TO_DATE('2199-12-31', 'yyyy-mm-dd')) day_to,
c1.Vendor_ID,
c1.Customer_ID,
c1.rank
FROM customer_records c1
LEFT JOIN customer_records c2
ON c2.Vendor_ID = c1.Vendor_ID
AND c2.Customer_ID = c1.Customer_ID
AND c2.rank <> c1.rank
AND c2.DAY > c1.DAY
AND NOT EXISTS (
SELECT 1
FROM customer_records c3
WHERE
c3.Vendor_Id = c1.Vendor_Id
AND c3.Customer_ID = c1.Customer_ID
AND c3.rank <> c1.rank
AND c3.DAY > c1.DAY
AND c3.DAY < c2.DAY
)
GROUP BY
c1.Vendor_Id,
c1.Customer_ID,
c1.rank,
c2.day
ORDER BY
day_from
Results :
VERSION | CURRENT_FLAG | DAY_FROM | DAY_TO | Vendor_ID | Customer_ID | rank ------: | :----------- | :-------- | :-------- | ------------------: | ----------: | -----------------: 1 | N | 24-SEP-14 | 22-OCT-14 | 71047795 | 476095 | 3103 2 | N | 22-OCT-14 | 12-NOV-15 | 71047795 | 476095 | 3102 3 | Y | 12-NOV-15 | 31-DEC-99 | 71047795 | 476095 | 3103
In Oracle you can turn any select into a merge query using the MERGE syntax. You can match on all columns expected current_flag
and day_to
, and update these if a record already exists ; else, just insert a new one.
MERGE INTO dimensions dim
USING (
-- above query goes here --
) cust
ON dim.DAY_FROM = cust.DAY_FROM
AND dim.vendor_id = cust.vendor_id
AND dim.Customer_ID = cust.Customer_ID
AND dim.rank = cust.rank
WHEN MATCHED THEN UPDATE SET
dim.DAY_TO = cust.DAY_TO,
dim.CURRENT_FLAG = cust.CURRENT_FLAG
WHEN NOT MATCHED THEN
INSERT (
dim.DAY_FROM,
dim.VERSION,
dim.CURRENT_FLAG,
dim.DAY_FROM,
dim.DAY_TO,
dim.vendor_id,
dim.customer_id,
dim.rank
) VALUES (
cust.DAY_FROM,
cust.VERSION,
cust.CURRENT_FLAG,
cust.DAY_FROM,
cust.DAY_TO,
cust.vendor_id,
cust.Customer_ID,
cust.rank
)
回答2:
I want to be able to do it using merge.
MERGE won't do it for you. MERGE is basically a case statement: for each record in the USING subquery we can insert upmatched records or update matched records. The catch is, when an existing customer's tier changes you need to execute DML for two dimension records:
- update the previous current record - set
current_flag
= 'N', setday_to
tosystimestamp
(or whatever). - insert new current record.
So you need to have a process - probably a PL/SQL procedure - which executes an UPDATE statement to close off the expired current records followed by an INSERT to add the new current records.
Sub-query may not be the best route I believe.
You describe yourself as relatively new to SQL, so you might worry about this but don't. Avoid premature optimization. Do the simplest thing which could work and tune it as required. A subquery should be the most efficient way tp identify the current records you need to update. Oracle databases are workhorses and can handle substantial loads, provided we write sensible SQL.
In your case that means:
- use set operations (i.e. not row-by-row) for both UPDATE and INSERT.
- make sure you work with the smallest set of records necessary. Only apply changes for records in the base table which have changed since the last time you refreshed the dimension. In your case you need to track
customer_records.snapshot_day
and only apply changes for the records which have a highersnapshot_day
(or maybe not, I'm guessing at your process). - index your dimension table properly so it applies the subquery efficiently.
来源:https://stackoverflow.com/questions/54726665/implementing-type-2-scd-in-oracle