Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantasti
We can use look up tables for this. Like SSIS provides two DFS (Data Flow Transformations) i.e. Fuzzy Grouping and Fuzzy Lookup.
I would recommend loading a staging table on the destination server and then merge the results into a target table on the destination server. If you need to run any hygiene rules, then you could do this via stored procedure since you are bound to get better performance than through SSIS data flow transformation tasks. Besides, deduping is generally a multi-step process. You may want to dedupe on:
.
WITH
sample_records
( email_address
, entry_date
, row_identifier
)
AS
(
SELECT 'tester@test.com'
, '2009-10-08 10:00:00'
, 1
UNION ALL
SELECT 'tester@test.com'
, '2009-10-08 10:00:01'
, 2
UNION ALL
SELECT 'tester@test.com'
, '2009-10-08 10:00:02'
, 3
UNION ALL
SELECT 'the_other_test@test.com'
, '2009-10-08 10:00:00'
, 4
UNION ALL
SELECT 'the_other_test@test.com'
, '2009-10-08 10:00:00'
, 5
)
, filter_records
( email_address
, entry_date
, row_identifier
, sequential_order
, reverse_order
)
AS
(
SELECT email_address
, entry_date
, row_identifier
, 'sequential_order' = ROW_NUMBER() OVER (
PARTITION BY email_address
ORDER BY row_identifier ASC)
, 'reverse_order' = ROW_NUMBER() OVER (
PARTITION BY email_address
ORDER BY row_identifier DESC)
FROM sample_records
)
SELECT email_address
, entry_date
, row_identifier
FROM filter_records
WHERE reverse_order = 1
ORDER BY email_address;
There are lots of options for you on deduping files, but ultimately I recommend handling this in a stored procedure once you have loaded a staging table on the destination server. After you cleanse the data, then you can either MERGE or INSERT into your final destination.
Found this page link text might be worth looking at, although with 17 million records might take a bit too long