Should the order of LINQ query clauses affect Entity Framework performance?

后端 未结 4 2073
灰色年华
灰色年华 2020-12-06 04:33

I\'m using Entity Framework (code first) and finding the order I specify clauses in my LINQ queries is having a huge performance impact, so for example:

usin         


        
相关标签:
4条回答
  • 2020-12-06 04:48

    The core of the question is not "why does the order matter with LINQ?". LINQ just translates literally without reordering. The real question is "why do the two SQL queries have different performance?".

    I was able to reproduce the problem by only inserting 100k rows. In that case a weakness in the optimizer is being triggered: it does not recognize that it can do a seek on Colour due to the complex condition. In the first query the optimizer does recognize the pattern and creates an index seek.

    There is no semantic reason why this should be. A seek on an index is possible even when seeking on NULL. This is a weakness/bug in the optimizer. Here are the two plans:

    enter image description here

    EF tries to be helpful here because it assumes that both the column and the filter variable can be null. In that case it tries to give you a match (which according to C# semantics is the right thing).

    I tried undoing that by adding the following filter:

    Colour IS NOT NULL AND @p__linq__0 IS NOT NULL
    AND Size IS NOT NULL AND @p__linq__1 IS NOT NULL
    

    Hoping that the optimizer now uses that knowledge to simplify the complex EF filter expression. It did not manage to do so. If this had worked the same filter could have been added to the EF query providing an easy fix.

    Here are the fixes the I recommend in the order that you should try them:

    1. Make the database columns not-null in the database
    2. Make the columns not-null in the EF data model hoping that this will prevent EF from creating the complex filter condition
    3. Create indexes: Colour, Size and/or Size, Colour. They also remove them problem.
    4. Ensure that the filtering is done in the right order and leave a code comment
    5. Try to use INTERSECT/Queryable.Intersect to combine the filters. This often results in different plan shapes.
    6. Create an inline table-valued function that does the filtering. EF can use such a function as part of a bigger query
    7. Drop down to raw SQL
    8. Use a plan guide to change the plan

    All of these are workarounds, not root cause fixes.

    In the end I am not happy with both SQL Server and EF here. Both products should be fixed. Alas, they likely won't be and you can't wait for that either.

    Here are the index scripts:

    CREATE NONCLUSTERED INDEX IX_Widget_Colour_Size ON dbo.Widget
        (
        Colour, Size
        ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    CREATE NONCLUSTERED INDEX IX_Widget_Size_Colour ON dbo.Widget
        (
       Size, Colour
        ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    
    0 讨论(0)
  • 2020-12-06 05:06

    Note: I ran into this question long after others have already provided generally correct answers. I decided to post this as a separate answer only because I think the workaround can be helpful, and because you might appreciate having a better insight on the reason EF behaves this way.

    Short answer: The best workaround for this issue is to set this flag on your DbContext instance:

    context.Configuration.UseDatabaseNullSemantics = true;
    

    When you do this all the extra null checks will go away and your queries should perform faster if the were affected by this issue.

    Long answer: others in this thread are right that in EF6 we have introduced the extra null checking terms by default to compensate for differences between the semantics of null comparisons in the database (three-valued logic) and standard in-memory null comparisons. The goal of this is to satisfy the following very popular request:

    Incorrect handling of null variables in 'where' clause

    Paul White is also right that the in the following expression the 'AND NOT' part is less common in for compensating for three-valued logic:

    ((x = y) AND NOT (x IS NULL OR y IS NULL)) OR (x IS NULL AND y IS NULL)
    

    That extra condition is necessary in the general case to prevent the result from the whole expression to be NULL, e.g. assume that x = 1 and y = NULL. Then

    (x = y) --> NULL 
    (x IS NULL AND y IS NULL) --> false
    NULL OR false --> NULL
    

    The distinction between NULL and false is important in case the comparison expression is negated at a later point in the composition of the query expression, e.g.:

    NOT (false) --> true 
    NOT (NULL) --> NULL
    

    It is also true that we could potentially add the smarts to EF to figure out when this extra term is unnecessary (e.g. if we know that the expression isn't negated in the predicate of the query) and to optimize it out of the query.

    By the way, we are tracking this issue in the following EF bug at codeplex:

    [Performance] Reduce the expression tree for complex queries in case of C# null comparison semantics

    0 讨论(0)
  • 2020-12-06 05:06

    Linq-to-SQL will generate the equivalent SQL query for your Linq code. What that means is that it will filter in the same order you specify. It doesn't really have a way to know which will be faster without running it to test.

    Either way round, your first filtering will be operating on the whole dataset, and will therefore be slow. However...

    • If you filter on the rare condition first, then it can cut the whole table down to a small set of results. Then your second filtering has only a small set to work on, which doesn't take long.
    • If you filter on the common condition first, then the set of data left afterwards is still quite large. The second filtering therefore operates on a large set of data, and therefore takes a little longer.

    So, rare first means slow + fast, while common first means slow + slow. The only way for Linq-to-SQL to optimise this distinction away for you is to first make a query to check which of the two conditions is rarer, but this means that the generated SQL would either be different each time you ran it (and therefore couldn't be cached to speed it up) or would be significantly more complex than what you wrote in Linq (which the Linq-to-SQL designers didn't want, probably because it could make debugging a nightmare for the user).

    There's nothing to stop you from making this optimisation yourself though; add a query beforehand to count and see which of the two filters will produce a smaller result set for the second filter to work on. For small databases, this will be slower in almost every case because you're making a whole extra query, but if your database is big enough and your check query is clever it might end up being faster on average. Also, it might be possible to work out how many there would have to be of condition A for it to be faster regardless of how many condition B objects you have, and then just count condition A, which would help make the check query faster.

    0 讨论(0)
  • 2020-12-06 05:08

    When tuning SQL queries it certainly matters what order you filter your results in. Why would you expect Linq-to-SQL to never be affected by the order of filtering?

    0 讨论(0)
提交回复
热议问题