Remove duplicate in SSIS package with preference over a column data

一笑奈何 提交于 2020-02-08 03:17:35

问题


I have duplicate rows in data coming from excel sheet. In the SSIS package, I am using Sort transformation where sorting is done in ascending order by the primary key column ID. But before removing the duplicates I want to see if the email column has email with my company's domain. If so, I want other rows removed than the one having this type of email addresses. What should I do? Please refer to the image attached below.

In the data above, I want to remove two rows of John where email address are john@gmail.com. In Maria's case, I want to remove two rows having email addresses maria@gmail.com, hence preserving rows having email addresses of the domain mycompany.com. If there are multiple rows for a user having email addresses of the domain mycompany.com, I want to keep any one row with the domain email address.

Suggest please.


回答1:


you can do that in sql like Kobi showed, that may be easier. But if you prefer in ssis:

My test data:

Some points:

Conditional split: First you separate rows with mycompany and those without.

Sort and non_mycompany sort: sort both output on id and remove duplicates.

mycompany_multicast: create two copy of rows with mycompany

Merge join: left join rows without mycompany to rows with mycompany. Note the join order, the purpose is to get rows without mycompany and no matching id in rows with mycompany.

Conditional split1: take rows without mycompany and no matching id in rows with mycompany. you can check id from rows with mycompany, if the id is null then the row has no matching in rows with mycompany.

union all: union the final result




回答2:


You can use a statement like this:

WITH T AS
(
SELECT ROW_NUMBER() OVER (partition BY id ORDER BY id, CASE WHEN email LIKE '%@mycompany.com' THEN 0 ELSE 1 END ) rn FROM persons
)
DELETE FROM T
 WHERE rn > 1

It sort all rows by similar ID and email ( the prefered mail with @mycompany is the first of the list), then add a rownumber on each group, and to finish, it delete all rows wich have a rownumber superior to 1 ( theses are duplicates)

Here is the data to test:

CREATE TABLE Persons ( id NUMERIC(5), NAME VARCHAR(200), email VARCHAR(400) );

INSERT INTO persons VALUES ( 100, 'john', 'john@mycompany.com'), ( 100, 'john', 'john@gmail.com'), ( 100, 'john', 'john@gmail.com');

INSERT INTO persons VALUES ( 200, 'maria', 'maria@mycompany.com'), ( 200, 'maria', 'maria@gmail.com'), ( 200, 'maria', 'maria@gmail.com');

INSERT INTO persons VALUES ( 300, 'jean', 'jean@mycompany.com'), ( 300, 'jean', 'jean@gmail.com'), ( 300, 'jean', 'jean@mycompany.com'), ( 300, 'jean', 'jean@mycompany.com');

INSERT INTO persons VALUES ( 400, 'tom', 'tom@gmail.com'), ( 400, 'tom', 'tom@gmail.com');



来源:https://stackoverflow.com/questions/39013943/remove-duplicate-in-ssis-package-with-preference-over-a-column-data

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!