How to remove duplicate rows from flat file using SSIS?

后端 未结 9 2090
轻奢々
轻奢々 2021-01-12 21:42

Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantasti

相关标签:
9条回答
  • 2021-01-12 22:09

    We can use look up tables for this. Like SSIS provides two DFS (Data Flow Transformations) i.e. Fuzzy Grouping and Fuzzy Lookup.

    0 讨论(0)
  • 2021-01-12 22:10

    I would recommend loading a staging table on the destination server and then merge the results into a target table on the destination server. If you need to run any hygiene rules, then you could do this via stored procedure since you are bound to get better performance than through SSIS data flow transformation tasks. Besides, deduping is generally a multi-step process. You may want to dedupe on:

    1. Distinct lines.
    2. Distinct groups of columns like First Name, Last Name, Email Address, etc.
    3. You may want to dedupe against an existing target table. If that's the case, then you may need to include NOT EXISTS or NOT IN statements. Or you may want to update the original row with new values. This usually is best served with a MERGE statement and a subquery for the source.
    4. Take the first or last row of a particular pattern. For instance, you may want the last row entered in the file for each occurrence of an email address or phone number. I usually rely on CTE's with ROW_NUMBER() to generate sequential order and reverse order columns like in the folling sample:

    .

    WITH    
        sample_records 
        (       email_address
            ,   entry_date
            ,   row_identifier
        )
        AS
        (
                SELECT      'tester@test.com'
                        ,   '2009-10-08 10:00:00'
                        ,   1
            UNION ALL
    
                SELECT      'tester@test.com'
                        ,   '2009-10-08 10:00:01'
                        ,   2
    
            UNION ALL
    
                SELECT      'tester@test.com'
                        ,   '2009-10-08 10:00:02'
                        ,   3
    
            UNION ALL
    
                SELECT      'the_other_test@test.com'
                        ,   '2009-10-08 10:00:00'
                        ,   4
    
            UNION ALL
    
                SELECT      'the_other_test@test.com'
                        ,   '2009-10-08 10:00:00'
                        ,   5
        )
    ,   filter_records 
        (       email_address
            ,   entry_date
            ,   row_identifier
            ,   sequential_order
            ,   reverse_order
        )
        AS
        (
            SELECT  email_address
                ,   entry_date
                ,   row_identifier
                ,   'sequential_order'  = ROW_NUMBER() OVER (
                                            PARTITION BY    email_address 
                                            ORDER BY        row_identifier ASC)
                ,   'reverse_order'     = ROW_NUMBER() OVER (
                                            PARTITION BY    email_address
                                            ORDER BY        row_identifier DESC)
            FROM    sample_records
        )
        SELECT      email_address
                ,   entry_date
                ,   row_identifier
        FROM        filter_records
        WHERE       reverse_order = 1
        ORDER BY    email_address;
    

    There are lots of options for you on deduping files, but ultimately I recommend handling this in a stored procedure once you have loaded a staging table on the destination server. After you cleanse the data, then you can either MERGE or INSERT into your final destination.

    0 讨论(0)
  • 2021-01-12 22:16

    Found this page link text might be worth looking at, although with 17 million records might take a bit too long

    0 讨论(0)
提交回复
热议问题