Entity Framework Include OrderBy random generates duplicate data

后端 未结 6 581
[愿得一人]
[愿得一人] 2020-11-30 02:29

When I retrieve a list of items from a database including some children (via .Include), and order the randomly, EF gives me an unexpected result.. I creates/clones addition

相关标签:
6条回答
  • 2020-11-30 02:29

    I also ran into this problem, and solved it by adding a Randomizer Guid property to the main class I was fetching. I then set the column's default value to NEWID() like this (using EF Core 2)

    builder.Entity<MainClass>()
        .Property(m => m.Randomizer)
        .HasDefaultValueSql("NEWID()");
    

    When fetching, it gets a bit more complicated. I created two random integers to function as my order-by indexes, then ran the query like this

    var rand = new Random();
    var randomIndex1 = rand.Next(0, 31);
    var randomIndex2 = rand.Next(0, 31);
    var taskSet = await DbContext.MainClasses
        .Include(m => m.SubClass1)
            .ThenInclude(s => s.SubClass2)
        .OrderBy(m => m.Randomizer.ToString().Replace("-", "")[randomIndex1])
            .ThenBy(m => m.Randomizer.ToString().Replace("-", "")[randomIndex2])
        .FirstOrDefaultAsync();
    

    This seems to be working well enough, and should provide enough entropy for even a large dataset to be fairly randomized.

    0 讨论(0)
  • 2020-11-30 02:31

    tl;dr: There's a leaky abstraction here. To us, Include is a simple instruction to stick a collection of things onto each single returned Person row. But EF's implementation of Include is done by returning a whole row for each Person-Address combo, and reassembling at the client. Ordering by a volatile value causes those rows to become shuffled, breaking apart the Person groups that EF is relying on.


    When we have a look at ToTraceString() for this LINQ:

     var people = c.People.Include("Addresses");
     // Note: no OrderBy in sight!
    

    we see

    SELECT 
    [Project1].[Id] AS [Id], 
    [Project1].[Name] AS [Name], 
    [Project1].[C1] AS [C1], 
    [Project1].[Id1] AS [Id1], 
    [Project1].[Data] AS [Data], 
    [Project1].[PersonId] AS [PersonId]
    FROM ( SELECT 
        [Extent1].[Id] AS [Id], 
        [Extent1].[Name] AS [Name], 
        [Extent2].[Id] AS [Id1], 
        [Extent2].[PersonId] AS [PersonId], 
        [Extent2].[Data] AS [Data], 
        CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
        FROM  [Person] AS [Extent1]
        LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
    )  AS [Project1]
    ORDER BY [Project1].[Id] ASC, [Project1].[C1] ASC
    

    So we get n rows for each A, plus 1 row for each P without any As.

    Adding an OrderBy clause, however, puts the thing-to-order-by at the start of the ordered columns:

    var people = c.People.Include("Addresses").OrderBy(p => Guid.NewGuid());
    

    gives

    SELECT 
    [Project1].[Id] AS [Id], 
    [Project1].[Name] AS [Name], 
    [Project1].[C2] AS [C1], 
    [Project1].[Id1] AS [Id1], 
    [Project1].[Data] AS [Data], 
    [Project1].[PersonId] AS [PersonId]
    FROM ( SELECT 
        NEWID() AS [C1], 
        [Extent1].[Id] AS [Id], 
        [Extent1].[Name] AS [Name], 
        [Extent2].[Id] AS [Id1], 
        [Extent2].[PersonId] AS [PersonId], 
        [Extent2].[Data] AS [Data], 
        CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
        FROM  [Person] AS [Extent1]
        LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
    )  AS [Project1]
    ORDER BY [Project1].[C1] ASC, [Project1].[Id] ASC, [Project1].[C2] ASC
    

    So in your case, where the ordered-by-thing is not a property of a P, but is instead volatile, and therefore can be different for different P-A records of the same P, the whole thing falls apart.


    I'm not sure where on the working-as-intended ~~~ cast-iron bug continuum this behaviour falls. But at least now we know about it.

    0 讨论(0)
  • 2020-11-30 02:32

    I dont think there is an issue in query generation, but there is definately an issue when EF tries to convert rows into object.

    It looks like there is an inherent assumption here that data for the same person in a joined statement will be returned grouped together order by or not.

    for example the result of a joined query will always be

    P.Id P.Name  A.Id A.StreetLine
    1    Person 1 10    --- 
    1    Person 1 11
    2    Person 2 12
    3    Person 3 13
    3    Person 3 14 
    

    even if you order by some other column, same person would always appear one after the other.

    this assumption is mostly true for any joined query.

    But there is a deeper issue here i think. OrderBy is for when you want data in certain order ( as opposite to random), so that assumption does seem reasonable.

    i think you should really get data out and then randomize it according to some other means in your code

    0 讨论(0)
  • 2020-11-30 02:36

    From theory: To sort a list of items, the compare function should be stable relative to items; this means that for any 2 items x, y the result of x< y should be the same as many time is queried(called).

    I think the issue is related to misunderstanding of specification(documentation) of OrderBy method: keySelector - A function to extract a key from an element.

    EF didn't mention explicitly if the provided function should return the same value for same object as many times is called (in your case returns different/random values), but I think the "key" term that they used in documentation implicitly suggested this.

    0 讨论(0)
  • 2020-11-30 02:50

    When you define a query path to define the query results, (use Include), the query path is only valid on the returned instance of ObjectQuery. Other instances of ObjectQuery and the object context itself are not affected. This functionality lets you chain multiple "Includes" for eager loading.

    Therefor, Your statement translates into

    from person in db.Persons.Include(p => p.Addresses).OrderBy(p => Guid.NewGuid())
    select person
    

    instead of what you intended.

    from person in db.Persons.Include(p => p.Addresses)
    select person
    .OrderBy(p => Guid.NewGuid())
    

    Hence your second workaround works fine :)

    Reference: Loading Related Objects While Querying A Conceptual Model in Entity Framework - http://msdn.microsoft.com/en-us/library/bb896272.aspx

    0 讨论(0)
  • 2020-11-30 02:53

    As one can sort it out by reading AakashM answer and Nicolae Dascalu answer, it strongly seems Linq OrderBy requires a stable ranking function, which NewID/Guid.NewGuid is not.

    So we have to use another random generator that would be stable inside a single query.

    To achieve this, before each querying, use a .Net Random generator to get a random number. Then combine this random number with a unique property of the entity to get randomly sorted. And to 'randomize' a bit the result, checksum it. (checksum is a SQL Server function that compute a hash; original idea founded on this blog.)

    Assuming Person Id is an int, you could write your query this way :

    var rnd = (new Random()).NextDouble();
    var persons = db.Persons
        .Include(p => p.Addresses)
        .OrderBy(p => SqlFunctions.Checksum(p.Id * rnd));
    

    Like the NewGuid hack, this is very probably not a good random generator with a good distribution and so on. But it does not cause entities to get duplicated in results.

    Beware:
    If your query ordering does not guarantees uniqueness of your entities ranking, you must complement it for guarantying it. By example, if you use a non-unique property of your entities for the checksum call, then add something like .ThenBy(p => p.Id) after the OrderBy.
    If your ranking is not unique for your queried root entity, its included children may get mixed with children of other entities having the same ranking. And then the bug will stay here.

    Note:
    I would prefer use .Next() method to get an int then combine it through a xor (^) to an entity int unique property, rather than using a double and multiply it. But SqlFunctions.Checksum unfortunately does not provide an overload for int data type, though the SQL server function is supposed to support it. You may use a cast to overcome this, but for keeping it simple I finally had chosen to go with the multiply.

    0 讨论(0)
提交回复
热议问题