问题
I have 2 tables ...
- Customer
- CustomerIdentification
Customer table has 2 fields
- CustomerId varchar(20)
- Customer_Id_Link varchar(50)
CustomerIdentification table has 3 fields
- CustomerId varchar(20)
- Identification_Number varchar(50)
- Personal_ID_Type_Code int -- is a foreign key to another table but thats irrelevant
Basically, Customer is the customer master table (with CustomerID as primary key) and CustomerIdentification can have several pieces of identifications for a given customer. In other words, CustomerId in CustomerIdentification is a foriegn key to Customer table. A customer can have many pieces of identifications, each having a Identification_Number
and Personal_ID_Type_Code
(which is an integer that tells you whether the identification is a passport, sin, drivers license etc.).
Now, customer table has the following data: Customer_Id_Link
is blank (empty string) at this point
CustomerId Customer_Id_Link
--------------------------------
'CU-1' <Blank>
'CU-2' <Blank>
'CU-3' <Blank>
'CU-4' <Blank>
'CU-5' <Blank>
and CustomerIdentification table has the following data:
CustomerId Identification_Number Personal_ID_Type_Code
------------------------------------------------------------
'CU-1' 'A' 1
'CU-1' 'A' 2
'CU-1' 'A' 3
'CU-2' 'A' 1
'CU-2' 'B' 3
'CU-2' 'C' 4
'CU-3' 'A' 1
'CU-3' 'B' 2
'CU-3' 'C' 4
'CU-4' 'A' 1
'CU-4' 'B' 2
'CU-4' 'B' 3
'CU-5' 'B' 3
Essentially, more than one customer can have same Identification_Number
and Personal_ID_Type_Code
in CustomerIdentification
. When this happens, all Customer_Id_Link fields need to be updated with a common value (could be a GUID or whatever). But the processing for this is more complex.
Rules are these:
For matching Personal_ID_Type_Code
and Identification_Number
fields between Customer Records
- Compare the Identification_Number
fields for all other common Personal_ID_Type_Code
fields for all the Customer Records from the above match
- if true, then link the Customer Records
For example:
Match ID 1 A for CU-1, CU-2, CU-3, CU-4
- Exception ID 2 mismatch (A on CU-1 vs B on CU-3)
- No linkage done
Match ID 2 B for CU-3, CU-4
- No ID mismatch
- Link CU-3 and CU-4 (update
Customer_Id_Link
field with a common value in customer table for both)
Match ID 3 A for CU-1, CU-4
- Exception ID 2 mismatch (A vs B)
- No linkage done
Match ID 3 B for CU-2, CU-5
- No ID mismatch
- Link CU-2 and CU-5 (update
Customer_Id_Link
field with a common value in customer table for both) Match ID 4 C for CU-2, CU-3 - CU-2 already linked, keep CU-5 to customer linking list
- CU-3 already linked, keep CU-4 to customer linking list
- Exception ID 3 mismatch (B on CU-2 vs A on CU-4)
- No linkage done (previous linkage remains)
Any help will be appreciated. This has kept me awake for two days now, and I cant seem to be able to find the solution. Ideally, the solution will be a stored procedure that I can execute to do customer linking.
- SQL Server 2008 R2 Standard 64 bit
UPDATE-------------------------------
I knew it was going to be tough to explain this problem, so I take the blame. But essentially, I want to be able to link all the customers that have same identificationNumbers, only, a customer can have more than 1 identificationNumber. Take example 1. 1 A (1 being Personal_id_type_code and A being identificationNumber exists for 4 different customers. CU-1, CU-2, CU-3, CU-4. So they could potentially be the same customer that exists 4 different times in customer table with different customer ID. We need to link them with 1 common value. However, CU-1 has 2 other identifications and if even 1 of them is different from the other 3 (CU-2, CU-3, CU-4) they are not the same customer. So ID 2 with Num A does not match with ID 2 for CU-3 (its B) and same for CU-4. Also, even though ID 2 num A does not exist in CU-2, CU-1's ID 3 and num A does not match with CU-2s ID 3 (its B). Therefore its not a match at all.
Next common Id's and num is 2-b which exists in CU-3 and CU-4. These two customers are in fact same cause both have ID 1 - A and ID 2 - B. ID 4 - C and ID 3 - A is irrelevant cause both IDs are different. Which essentially means this customer has 4 IDs I A, 2 B, 4 C and 3 A. So now we need to link this customer with a common unique value (guid) in customer table.
I hope I explained this very complicated issue now. It is tough to explain as this is a very unique problem.
回答1:
I've changed your data model a bit to try and make it a bit more obvious what's going on..
CREATE TABLE [dbo].[Customer]
(
[CustomerName] VARCHAR(20) NOT NULL,
[CustomerLink] VARBINARY(20) NULL
)
CREATE TABLE [dbo].[CustomerIdentification]
(
[CustomerName] VARCHAR(20) NOT NULL,
[ID] VARCHAR(50) NOT NULL,
[IDType] VARCHAR(16) NOT NULL
)
And I've added some more test data..
INSERT [dbo].[Customer]
([CustomerName])
VALUES ('Fred'),
('Bob'),
('Vince'),
('Tom'),
('Alice'),
('Matt'),
('Dan')
INSERT [dbo].[CustomerIdentification]
VALUES
('Fred', 'A', 'Passport'),
('Fred', 'A', 'SIN'),
('Fred', 'A', 'Drivers Licence'),
('Bob', 'A', 'Passport'),
('Bob', 'B', 'Drivers Licence'),
('Bob', 'C', 'Credit Card'),
('Vince', 'A', 'Passport'),
('Vince', 'B', 'SIN'),
('Vince', 'C', 'Credit Card'),
('Tom', 'A', 'Passport'),
('Tom', 'B', 'SIN'),
('Tom', 'B', 'Drivers Licence'),
('Alice', 'B', 'Drivers Licence'),
('Matt', 'X', 'Drivers Licence'),
('Dan', 'X', 'Drivers Licence')
Is this what you're looking for:
;WITH [cteNonMatchingIDs] AS (
-- Pairs where the IDType is the same, but
-- name and ID don't match
SELECT ci3.[CustomerName] AS [CustomerName1],
ci4.[CustomerName] AS [CustomerName2]
FROM [dbo].[CustomerIdentification] ci3
INNER JOIN [dbo].[CustomerIdentification] ci4
ON ci3.[IDType] = ci4.[IDType]
WHERE ci3.[CustomerName] <> ci4.[CustomerName]
AND ci3.[ID] <> ci4.[ID]
),
[cteMatchedPairs] AS (
-- Pairs where the IDType and ID match, and
-- there aren't any non matching IDs for the
-- CustomerName
SELECT DISTINCT
ci1.[CustomerName] AS [CustomerName1],
ci2.[CustomerName] AS [CustomerName2]
FROM [dbo].[CustomerIdentification] ci1
LEFT JOIN [dbo].[CustomerIdentification] ci2
ON ci1.[CustomerName] <> ci2.[CustomerName]
AND ci1.[IDType] = ci2.[IDType]
WHERE ci1.[ID] = ISNULL(ci2.[ID], ci1.[ID])
AND NOT EXISTS (
SELECT 1
FROM [cteNonMatchingIDs]
WHERE ci1.[CustomerName] = [CustomerName1] -- correlated subquery
AND ci2.[CustomerName] = [CustomerName2]
)
AND ci1.[CustomerName] < ci2.[CustomerName]
),
[cteMatchedList] ([CustomerName], [CustomerNameList]) AS (
-- Turn the matched pairs into list of matching
-- CustomerNames
SELECT [CustomerName1],
[CustomerNameList]
FROM (
SELECT [CustomerName1],
CONVERT(VARCHAR(1000), '$'
+ [CustomerName1] + '$'
+ [CustomerName2]) AS [CustomerNameList]
FROM [cteMatchedPairs]
UNION ALL
SELECT [CustomerName2],
CONVERT(VARCHAR(1000), '$'
+ [CustomerName2]) AS [CustomerNameList]
FROM [cteMatchedPairs]
) [cteMatchedPairs]
UNION ALL
SELECT [cteMatchedList].[CustomerName],
CONVERT(VARCHAR(1000),[CustomerNameList] + '$'
+ [cteMatchedPairs].[CustomerName2])
FROM [cteMatchedList] -- recursive CTE
INNER JOIN [cteMatchedPairs]
ON RIGHT([cteMatchedList].[CustomerNameList],
LEN([cteMatchedPairs].[CustomerName1])
) = [cteMatchedPairs].[CustomerName1]
),
[cteSubstringLists] AS (
SELECT r1.[CustomerName],
r2.[CustomerNameList]
FROM [cteMatchedList] r1
INNER JOIN [cteMatchedList] r2
ON r2.[CustomerNameList] LIKE '%' + r1.[CustomerNameList] + '%'
),
[cteCustomerLink] AS (
SELECT DISTINCT
x1.[CustomerName],
HASHBYTES('SHA1', x2.[CustomerNameList]) AS [CustomerLink]
FROM (
SELECT [CustomerName],
MAX(LEN([CustomerNameList])) AS [MAX LEN CustomerList]
FROM [cteSubstringLists]
GROUP BY [CustomerName]
) x1
INNER JOIN (
SELECT [CustomerName],
LEN([CustomerNameList]) AS [LEN CustomerList],
[CustomerNameList]
FROM [cteSubstringLists]
) x2
ON x1.[MAX LEN CustomerList] = x2.[LEN CustomerList]
AND x1.[CustomerName] = x2.[CustomerName]
)
UPDATE c
SET [CustomerLink] = cl.[CustomerLink]
FROM [dbo].[Customer] c
INNER JOIN [cteCustomerLink] cl
ON cl.[CustomerName] = c.[CustomerName]
SELECT *
FROM [dbo].[Customer]
来源:https://stackoverflow.com/questions/8014815/a-very-complicated-sql-query-issue