How to prune duplicate associations to yield a unique most-complete set

。_饼干妹妹 提交于 2020-01-03 01:51:26

问题


I hardly know how to state this question, let alone search for answers. But here's my best shot. Assume I have a table

Col1   Col2
-----+-----
 A   | 1
 A   | 2
 A   | 3
 A   | 4
 B   | 1
 B   | 2
 B   | 3
 C   | 1
 C   | 2
 C   | 3
 D   | 1

I want to find the subset of associations (rows) where:

  1. There are no duplicates in Col1
  2. There are no duplicates in Col2
  3. Every value in Col1 is associated with a value in Col2

So the above example could yield this result

Col1   Col2
-----+-----
 A   | 4
 B   | 2
 C   | 3
 D   | 1

Notice that A-4 must be in the result because there are 4 unique letters and unique 4 numbers, so if you don't associate A to 4, there's no subset remaining that doesn't map every value in Col1 while retaining the uniqueness of Col2.

Also, notice that it would be equally valid to replace B-2 and C-3 with B-3 and C-2. I don't care which subset is selected, but I want one that fulfills all the requirements.

Not every set of data will have a sub-set that fulfills all the requirements, but I want to get as close as possible.

I'm trying to do this with a SQL query. I had a query that seemed to accomplish this for one set of data, but then I had to rewrite it for a slightly different set (where Col2 is actually a pair of columns) and could not reproduce my earlier success. My first solution used Min() and Group By and a couple Joins on aggregated results to mark duplicates for elimination in a loop until there was nothing left to safely eliminate. My more recent solution replaces the Group By queries with ROW_NUMBER() expressions that use PARTITION_BY. But I can't figure out how to handle the cases where there are multiple valid result sets from multiply-cross-linked pairs like B and C in the above example. My earlier query might have handled it, but I can't quite comprehend what I did (must have had a good day when I wrote that one). Perhaps I need to do a JOIN on the ROW_NUMBER expressions in my sub-queries? My brain gave out for today. I hope someone can help me find an ingeniously simple solution.


回答1:


It seems to me that you're aiming for something that SQL is not strong enough for. This is a non-standard algorithmic task, and I think you need a real programming language to achieve it. Your task reminds me of chess riddles.




回答2:


The problem is equivalent to finding a maximum matching in a bipartite graph. Each column element represents a vertex, each row represents an edge. The linked Wikipedia article provides some pointers to algorithms for solving this problem. There is an implementation of the Hungarian algorithm in Google's or-tools library.

Here's the given example formulated as a graph, with the red edges representing the given solution:

It would be surprising to me if you could find a solution purely in SQL.




回答3:


Try this query, its not great for huge dataset but does what you want, if there is a value in col1 for which it cannot find a unique col2 it would put 0 which is hardcoded, change it to any value to indicate absense of a unique value. I used table named testing (col1, col2) replace your table name in the place of testing.

This is a greedy algorithm which would try to maximize the chance of associating a value in Col1 to all values of Col2. Steps are as follows.

  1. Retrieve Col1 based on the number of Col2 values it is associated in ascending order.
  2. Start with the Col1 which has minimal number of Col2 and associate the value (Start with D as only one value is associated).
  3. Go to next unassociated value (B or C since they have 3 values, associate any of the value which is not in the list of already associated value, 1 is associated with D so 2 or 3 ).
  4. Repeat step 3 for all values in the list selected in step 1.

List item

Following code implements this algo, and its not optimal implementation.

DECLARE @COUNTER    INT = 1
DECLARE @MAX        INT = 0  
DECLARE @COL2       CHAR(1) = NULL

DECLARE @TEMPTABLE TABLE
(
    ROWNUM  INT     IDENTITY(1,1)
    ,COL1   CHAR(1)
    ,COL2   INT
)

INSERT INTO @TEMPTABLE
SELECT COL1, 0
FROM    testing
GROUP BY COL1
ORDER BY COUNT(COL2)

SELECT @MAX = MAX(ROWNUM) FROM @TEMPTABLE

WHILE (  @COUNTER <= @MAX )
BEGIN
        UPDATE @TEMPTABLE 
        SET COL2 = T.COL2
        FROM TESTING T
        INNER JOIN @TEMPTABLE TT
        ON  T.COL1 = TT.COL1
        WHERE T.COL2 NOT IN (SELECT DISTINCT COL2 FROM @TEMPTABLE)
        AND TT.ROWNUM = @COUNTER
        SET @COUNTER = @COUNTER + 1
END

SELECT COL1, COL2 FROM @TEMPTABLE



回答4:


This seems to do the trick (I will review the other answers and compare after posting):

CREATE TABLE Trial(Col1 nvarchar(5) not null, Col2 int not null, Eliminated bit not null)

INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('A', 4, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('B', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 1, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 2, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('C', 3, 0)
INSERT INTO Trial(Col1, Col2, Eliminated) VALUES('D', 1, 0)

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T2.Dups > T1.Dups AND T1.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col1) T1
   ON T0.Col1 = T1.Col1
JOIN (
   SELECT Col2, COUNT(*) Dups
   FROM Trial
   WHERE Eliminated = 0
   GROUP BY Col2) T2
   ON T2.Col2 = T0.Col2
WHERE T1.Dups > T2.Dups AND T2.Dups > 1

UPDATE T0 SET Eliminated = 1
FROM Trial T0
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col1 ORDER BY Col2) Dup
   FROM Trial
   WHERE Eliminated = 0) T1 ON T1.Col1 = T0.Col1 AND T1.Col2 = T0.Col2
JOIN (
   SELECT Col1, Col2, ROW_NUMBER() OVER (PARTITION BY Col2 ORDER BY Col1) Dup
   FROM Trial
   WHERE Eliminated = 0) T2 ON T2.Col1 = T0.Col1 AND T2.Col2 = T0.Col2
WHERE T1.Dup <> T2.Dup

It may not be perfect, but seems to work on my data.



来源:https://stackoverflow.com/questions/5279693/how-to-prune-duplicate-associations-to-yield-a-unique-most-complete-set

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!