How to remove duplicate rows from flat file using SSIS?

后端未结

关注

 9  2090

Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantasti

相关标签:

9条回答

执笔经年

2021-01-12 22:09

We can use look up tables for this. Like SSIS provides two DFS (Data Flow Transformations) i.e. Fuzzy Grouping and Fuzzy Lookup.

0 讨论(0)
发布评论:

提交评论
- 加载中...

盖世英雄少女心

2021-01-12 22:10

I would recommend loading a staging table on the destination server and then merge the results into a target table on the destination server. If you need to run any hygiene rules, then you could do this via stored procedure since you are bound to get better performance than through SSIS data flow transformation tasks. Besides, deduping is generally a multi-step process. You may want to dedupe on:

Distinct lines.
Distinct groups of columns like First Name, Last Name, Email Address, etc.
You may want to dedupe against an existing target table. If that's the case, then you may need to include NOT EXISTS or NOT IN statements. Or you may want to update the original row with new values. This usually is best served with a MERGE statement and a subquery for the source.
Take the first or last row of a particular pattern. For instance, you may want the last row entered in the file for each occurrence of an email address or phone number. I usually rely on CTE's with ROW_NUMBER() to generate sequential order and reverse order columns like in the folling sample:

WITH    
    sample_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
    )
    AS
    (
            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   1
        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:01'
                    ,   2

        UNION ALL

            SELECT      'tester@test.com'
                    ,   '2009-10-08 10:00:02'
                    ,   3

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   4

        UNION ALL

            SELECT      'the_other_test@test.com'
                    ,   '2009-10-08 10:00:00'
                    ,   5
    )
,   filter_records 
    (       email_address
        ,   entry_date
        ,   row_identifier
        ,   sequential_order
        ,   reverse_order
    )
    AS
    (
        SELECT  email_address
            ,   entry_date
            ,   row_identifier
            ,   'sequential_order'  = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address 
                                        ORDER BY        row_identifier ASC)
            ,   'reverse_order'     = ROW_NUMBER() OVER (
                                        PARTITION BY    email_address
                                        ORDER BY        row_identifier DESC)
        FROM    sample_records
    )
    SELECT      email_address
            ,   entry_date
            ,   row_identifier
    FROM        filter_records
    WHERE       reverse_order = 1
    ORDER BY    email_address;

There are lots of options for you on deduping files, but ultimately I recommend handling this in a stored procedure once you have loaded a staging table on the destination server. After you cleanse the data, then you can either MERGE or INSERT into your final destination.

0 讨论(0)

长情又很酷

2021-01-12 22:16

Found this page link text might be worth looking at, although with 17 million records might take a bit too long

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2