Let me first say that being able to take 17 million records from a flat file, pushing to a DB on a remote box and having it take 7 minutes is amazing. SSIS truly is fantasti
A bit of a dirty solution is to set your target table up with a composite key that spans all columns. This will guarantee distint uniqueness. Then on the Data Destination shape, configure the task to ignore errors. All duplicate inserts will fall off into oblivion.
The strategy will usually depend on how many columns the staging table has. The more columns, the more complex the solution. The article you linked has some very good advice.
The only thing that I will add to what everybody else has said so far, is that columns with date and datetime values will give some of the solutions presented here fits.
One solution that I came up with is this:
SET NOCOUNT ON
DECLARE @email varchar(100)
SET @email = ''
SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)
WHILE @emailid IS NOT NULL
BEGIN
-- Do INSERT statement based on the email
INSERT StagingTable2 (Email)
FROM StagingTable WITH (NOLOCK)
WHERE email = @email
SET @emailid = (SELECT min(email) from StagingTable WITH (NOLOCK) WHERE email > @email)
END
This is a LOT faster when doing deduping, than a CURSOR and will not peg the server's CPU. To use this, separate each column that comes from the text file into their own variables. Use a separate SELECT statement before and inside the loop, then include them in the INSERT statement. This has worked really well for me.
Flat File Source --> Aggregate (Group By Columns you want unique) --> Flat File Destination
I would suggest using SSIS to copy the records to a temporary table, then create a task that uses Select Distinct or Rank depending on your situation to select the duplicates which would funnel them to a flat file and delete them from the temporary table. The last step would be to copy the records from the temporary table into the destination table.
Determining a duplicate is something SQL is good at but a flat file is not as well suited for. In the case you proposed, the script container would load a row and then would have to compare it against 17 million records, then load the next row and repeat...The performance might not be all that great.
Use the Sort Component.
Simply choose which fields you wish to sort your loaded rows by and in the bottom left corner you'll see a check box to remove duplicates. This box removes any rows which are duplicates based on the sort criteria only so in the example below the rows would be considered duplicate if we only sorted on the first field:
1 | sample A |
1 | sample B |
To do this on the flat file, I use the unix command line tool, sort:
sort -u inputfile > outputfile
Unfortunately, the windows sort command does not have a unique option, but you could try downloading a sort utility from one of these:
(I haven't tried them, so no guarantees, I'm afraid).
On the other hand, to do this as the records are loaded into the database, you could create a unique index on the key the database table whith ignore_dup_key. This will make the records unique very efficiently at load time.
CREATE UNIQUE INDEX idx1 ON TABLE (col1, col2, ...) WITH IGNORE_DUP_KEY