Implementing Type 2 SCD in Oracle

问题

First I would like to say that I am new to the stackoverflow community and relatively new to SQL itself and so please pardon me If I didn't format my question right or didn't state my requirements clearly.

I am trying to implement a type 2 SCD in Oracle. The structure of the source table (customer_records) is given below.

CREATE TABLE customer_records(
    day date,
    snapshot_day number,
    vendor_id number,
    customer_id number,
    rank number
);

INSERT INTO customer_records 
(day,snapshot_day,vendor_id,customer_id,rank)
VALUES
(9/24/2014,6266,71047795,476095,3103),
(10/1/2014,6273,71047795,476095,3103),
(10/8/2014,6280,71047795,476095,3103),
(10/15/2014,6287,71047795,476095,3103),
(10/22/2014,6291,71047795,476095,3102),
(10/29/2014,6330,71047795,476095,3102),
(11/05/2015,6351,71047795,476095,3102),
(11/12/2015,6440,71047795,476095,3103);

The above table is updated weekly and I have pulled records for a particular customer represented by vendor_id and customer_id. Such that each customer will have a unique vendor_id and customer_id. I am trying to track the changes in the tier (rank) of a customer. It may so happen that the customer's tier may remain same for several weeks and we are only willing to track when there is a change in the tier of the customer.

The desired output (dimension table) would look something like this:

SK  Version   Date_From    Date_To    Vendor_id   Customer_Id  Rank_Id

1     1       9/24/2014    10/22/2014    71047795            476095       3103
2     2       10/22/2014   11/05/2015    71047795            476095       3102
3     3       11/05/2015   12/31/2199    71047795            476095       3103

Such that whenever customer's tier hit a change we track that in a new table. Also, wanting to include the current_flag = 'Y' for the most current tier.

I want to be able to do it using merge.

回答1:

Here is an approach to group consecutive records having the same tier, while detecting changes.

The idea is to self-join the table, and to relate each record to the next record that has a different tier. This is done using a NOT EXISTS condition with a correlated subquery.

LEFT JOIN is needed, to avoid filtering out the last record (that owns the current tier), which does not have a next record yet : for this record, we use COALESCE() to set up a default end date.

SELECT 
    c1.day day_from,
    COALESCE(c2.day, TO_DATE('2199-12-31', 'yyyy-mm-dd')) day_to,
    c1.Vendor_ID,
    c1.Customer_ID, 
    c1.rank
FROM customer_records c1
LEFT JOIN customer_records c2 
    ON  c2.Vendor_ID = c1.Vendor_ID
    AND c2.Customer_ID         = c1.Customer_ID
    AND c2.rank <> c1.rank
    AND c2.DAY                 > c1.DAY
    AND NOT EXISTS (
        SELECT 1
        FROM customer_records c3
        WHERE
                c3.Vendor_ID = c1.Vendor_ID
            AND c3.Customer_ID         = c1.Customer_ID
            AND c3.rank <> c1.rank
            AND c3.DAY                 > c1.DAY
            AND c3.DAY                 < c2.DAY
    )

This returns :

 DAY_FROM  | DAY_TO    | Vendor_ID | Customer_ID | rank
 :-------- | :-------- | ------------------: | ----------: | -----------------:
 24-SEP-14 | 22-OCT-14 |            71047795 |      476095 |               3103
 01-OCT-14 | 22-OCT-14 |            71047795 |      476095 |               3103
 08-OCT-14 | 22-OCT-14 |            71047795 |      476095 |               3103
 15-OCT-14 | 22-OCT-14 |            71047795 |      476095 |               3103
 22-OCT-14 | 12-NOV-15 |            71047795 |      476095 |               3102
 29-OCT-14 | 12-NOV-15 |            71047795 |      476095 |               3102
 05-NOV-15 | 12-NOV-15 |            71047795 |      476095 |               3102
 12-NOV-15 | 31-DEC-99 |            71047795 |      476095 |               3103

Now we can group the record set by tier and end date to generate the expected results. ROW_NUMBER() can give you the version number. It is also easy to check which record is the current one, as explained above.

SELECT 
    ROW_NUMBER() OVER(ORDER BY c2.day) version,
    DECODE(c2.day, NULL, 'Y') current_flag,
    MIN(c1.day) day_from,
    COALESCE(c2.day, TO_DATE('2199-12-31', 'yyyy-mm-dd')) day_to,
    c1.Vendor_ID,
    c1.Customer_ID, 
    c1.rank
FROM customer_records c1
LEFT JOIN customer_records c2 
    ON  c2.Vendor_ID = c1.Vendor_ID
    AND c2.Customer_ID         = c1.Customer_ID
    AND c2.rank <> c1.rank
    AND c2.DAY                 > c1.DAY
    AND NOT EXISTS (
        SELECT 1
        FROM customer_records c3
        WHERE
                c3.Vendor_Id = c1.Vendor_Id
            AND c3.Customer_ID         = c1.Customer_ID
            AND c3.rank <> c1.rank
            AND c3.DAY                 > c1.DAY
            AND c3.DAY                 < c2.DAY
    )
GROUP BY
    c1.Vendor_Id, 
    c1.Customer_ID, 
    c1.rank, 
    c2.day
ORDER BY
    day_from

Results :

VERSION | CURRENT_FLAG | DAY_FROM  | DAY_TO    | Vendor_ID | Customer_ID | rank
------: | :----------- | :-------- | :-------- | ------------------: | ----------: | -----------------:
      1 | N            | 24-SEP-14 | 22-OCT-14 |            71047795 |      476095 |               3103
      2 | N            | 22-OCT-14 | 12-NOV-15 |            71047795 |      476095 |               3102
      3 | Y            | 12-NOV-15 | 31-DEC-99 |            71047795 |      476095 |               3103

In Oracle you can turn any select into a merge query using the MERGE syntax. You can match on all columns expected current_flag and day_to, and update these if a record already exists ; else, just insert a new one.

MERGE INTO dimensions dim
USING (
   -- above query goes here --
) cust 
    ON  dim.DAY_FROM            = cust.DAY_FROM
    AND dim.vendor_id = cust.vendor_id
    AND dim.Customer_ID         = cust.Customer_ID
    AND dim.rank  = cust.rank
WHEN MATCHED THEN UPDATE SET 
    dim.DAY_TO = cust.DAY_TO,
    dim.CURRENT_FLAG = cust.CURRENT_FLAG
WHEN NOT MATCHED THEN 
    INSERT (
        dim.DAY_FROM, 
        dim.VERSION, 
        dim.CURRENT_FLAG, 
        dim.DAY_FROM, 
        dim.DAY_TO, 
        dim.vendor_id, 
        dim.customer_id, 
        dim.rank
    ) VALUES (
        cust.DAY_FROM, 
        cust.VERSION, 
        cust.CURRENT_FLAG, 
        cust.DAY_FROM, 
        cust.DAY_TO, 
        cust.vendor_id, 
        cust.Customer_ID, 
        cust.rank
    )

回答2:

I want to be able to do it using merge.

MERGE won't do it for you. MERGE is basically a case statement: for each record in the USING subquery we can insert upmatched records or update matched records. The catch is, when an existing customer's tier changes you need to execute DML for two dimension records:

update the previous current record - set current_flag = 'N', set day_to to systimestamp (or whatever).
insert new current record.

So you need to have a process - probably a PL/SQL procedure - which executes an UPDATE statement to close off the expired current records followed by an INSERT to add the new current records.

Sub-query may not be the best route I believe.

You describe yourself as relatively new to SQL, so you might worry about this but don't. Avoid premature optimization. Do the simplest thing which could work and tune it as required. A subquery should be the most efficient way tp identify the current records you need to update. Oracle databases are workhorses and can handle substantial loads, provided we write sensible SQL.

In your case that means:

use set operations (i.e. not row-by-row) for both UPDATE and INSERT.
make sure you work with the smallest set of records necessary. Only apply changes for records in the base table which have changed since the last time you refreshed the dimension. In your case you need to track customer_records.snapshot_day and only apply changes for the records which have a higher snapshot_day (or maybe not, I'm guessing at your process).
index your dimension table properly so it applies the subquery efficiently.

来源：https://stackoverflow.com/questions/54726665/implementing-type-2-scd-in-oracle

标签

sql

Oracle

etl

bulk-operations