How to compare two columns to find unmatched records in MySQL

后端 未结 2 594
花落未央
花落未央 2021-02-04 17:40

I have a MySQL table with 2 columns and each column has thousands of records

For Example 15000 Email addresses in Column1 and 15005 Email addresses in column 2

H

2条回答
  •  有刺的猬
    2021-02-04 18:31

    Richard, it's highly unusual to find matching/missing rows from one column in a table compared against another column in the same table.

    You can think of a table as being a collection of facts, with each row being one fact. Converting values into predicates is how we understand the data. The value "12" in one table may mean "there exists a day on which 12 widgets were made," or "12 people bought widgets on Jan. 1," or "on Jan. 12, no widgets were sold," but whatever the table's corresponding predicate is, "12" should represent a fact.

    It's common to want to find the difference between two tables: "what facts are in B that aren't in A?" But in a table with two columns, each row should conceptually be a fact about that pair of values. Perhaps the predicate for the row (12, 13) might be "on Jan. 12, we sold 13 widgets." But in that case I doubt you'd be asking for this information.

    So, if (12,13) is really two of the same predicate -- "someone in district 12 bought widgets, and also, someone in district 13 bought widgets" -- in the long run life will be easier if those are one column, not two. And if it's two different predicates, it would make more sense for them to be in two tables. SQL's flexible and can handle these situations, but you may run into more problems later. If you're interested in more about this subject, searching on "normalization" will find you way more than you want to know :)

    Anyway, I think the query you're looking for uses a LEFT JOIN to compare the table against itself. I added the values 1-15000 to col1 and 1-15005 to col2 in this table:

    CREATE TABLE `foo` (
      `col1` int(11) DEFAULT NULL,
      `col2` int(11) DEFAULT NULL,
      KEY `idx_col1` (`col1`),
      KEY `idx_col2` (`col2`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    
    mysql> select count(distinct col1), count(distinct col2) from foo;
    +----------------------+----------------------+
    | count(distinct col1) | count(distinct col2) |
    +----------------------+----------------------+
    |                15000 |                15005 |
    +----------------------+----------------------+
    1 row in set (0.01 sec)
    

    By giving the same table two names, I can compare its two columns against each other, and find the col2 values that have no corresponding col1 values -- in those cases, f1.col1 will be NULL:

    mysql> select f2.col2
    from foo as f2 left join foo as f1 on (f2.col2=f1.col1)
    where f1.col1 is null;
    +-------+
    | col2  |
    +-------+
    | 15001 |
    | 15002 |
    | 15003 |
    | 15004 |
    | 15005 |
    +-------+
    5 rows in set (0.03 sec)
    

    Regarding Mosty's solution yesterday, I'm not sure it's correct. I try not to use subqueries, so I'm a little out of my depth here. But it doesn't seem to work for at least my attempt to replicate your data set:

    mysql> select col2 from foo where col2 not in
    (select col1 from foo);
    Empty set (0.02 sec)
    

    It works if I exclude the 5 NULLs from the subquery, which suggests to me that "NOT IN (NULL)" doesn't necessarily work the way one might think it works:

    mysql> select col2 from foo where col2 not in
    (select col1 from foo where col1 is not null);
    +-------+
    | col2  |
    +-------+
    | 15001 |
    | 15002 |
    | 15003 |
    | 15004 |
    | 15005 |
    +-------+
    5 rows in set (0.02 sec)
    

    The main reason I avoid subqueries in MySQL is that they have unpredictable performance characteristics, or at least, complex enough that I can't predict them. For more information, see the "O(MxN)" comment in http://dev.mysql.com/doc/refman/5.5/en/subquery-restrictions.html and the advice on the short webpage http://dev.mysql.com/doc/refman/5.5/en/rewriting-subqueries.html .

提交回复
热议问题