NOT IN vs NOT EXISTS

前端 未结 11 1523
粉色の甜心
粉色の甜心 2020-11-21 12:08

Which of these queries is the faster?

NOT EXISTS:

SELECT ProductID, ProductName 
FROM Northwind..Products p
WHERE NOT EXISTS (
    SELECT 1 
    FROM         


        
相关标签:
11条回答
  • 2020-11-21 12:45

    I was using

    SELECT * from TABLE1 WHERE Col1 NOT IN (SELECT Col1 FROM TABLE2)
    

    and found that it was giving wrong results (By wrong I mean no results). As there was a NULL in TABLE2.Col1.

    While changing the query to

    SELECT * from TABLE1 T1 WHERE NOT EXISTS (SELECT Col1 FROM TABLE2 T2 WHERE T1.Col1 = T2.Col2)
    

    gave me the correct results.

    Since then I have started using NOT EXISTS every where.

    0 讨论(0)
  • 2020-11-21 12:47

    If the optimizer says they are the same then consider the human factor. I prefer to see NOT EXISTS :)

    0 讨论(0)
  • 2020-11-21 12:48

    Actually, I believe this would be the fastest:

    SELECT ProductID, ProductName 
        FROM Northwind..Products p  
              outer join Northwind..[Order Details] od on p.ProductId = od.ProductId)
    WHERE od.ProductId is null
    
    0 讨论(0)
  • 2020-11-21 12:49

    I always default to NOT EXISTS.

    The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.

    When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.

    SELECT ProductID,
           ProductName
    FROM   Products p
    WHERE  NOT EXISTS (SELECT *
                       FROM   [Order Details] od
                       WHERE  p.ProductId = od.ProductId) 
    

    The exact plan may vary but for my example data I get the following.

    Neither NULL

    A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.

    /*Not valid syntax but better reflects the plan*/ 
    SELECT p.ProductID,
           p.ProductName
    FROM   Products p
           LEFT ANTI SEMI JOIN [Order Details] od
             ON p.ProductId = od.ProductId 
    

    If [Order Details].ProductID is NULL-able the query then becomes

    SELECT ProductID,
           ProductName
    FROM   Products p
    WHERE  NOT EXISTS (SELECT *
                       FROM   [Order Details] od
                       WHERE  p.ProductId = od.ProductId)
           AND NOT EXISTS (SELECT *
                           FROM   [Order Details]
                           WHERE  ProductId IS NULL) 
    

    The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.

    One NULL

    If Products.ProductID is also changed to become NULL-able the query then becomes

    SELECT ProductID,
           ProductName
    FROM   Products p
    WHERE  NOT EXISTS (SELECT *
                       FROM   [Order Details] od
                       WHERE  p.ProductId = od.ProductId)
           AND NOT EXISTS (SELECT *
                           FROM   [Order Details]
                           WHERE  ProductId IS NULL)
           AND NOT EXISTS (SELECT *
                           FROM   (SELECT TOP 1 *
                                   FROM   [Order Details]) S
                           WHERE  p.ProductID IS NULL) 
    

    The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.

    Both NULL

    The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.

    Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.

    This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.

    For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.

    Not EXists

    When the column changes to NULL-able the NOT IN plan now looks like

    Not In - Null

    It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.

    As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.

    0 讨论(0)
  • 2020-11-21 12:49

    I have a table which has about 120,000 records and need to select only those which does not exist (matched with a varchar column) in four other tables with number of rows approx 1500, 4000, 40000, 200. All the involved tables have unique index on the concerned Varchar column.

    NOT IN took about 10 mins, NOT EXISTS took 4 secs.

    I have a recursive query which might had some untuned section which might have contributed to the 10 mins, but the other option taking 4 secs explains, atleast to me that NOT EXISTS is far better or at least that IN and EXISTS are not exactly the same and always worth a check before going ahead with code.

    0 讨论(0)
提交回复
热议问题