When I retrieve a list of items from a database including some children (via .Include), and order the randomly, EF gives me an unexpected result.. I creates/clones addition
I also ran into this problem, and solved it by adding a Randomizer Guid property to the main class I was fetching. I then set the column's default value to NEWID() like this (using EF Core 2)
builder.Entity<MainClass>()
.Property(m => m.Randomizer)
.HasDefaultValueSql("NEWID()");
When fetching, it gets a bit more complicated. I created two random integers to function as my order-by indexes, then ran the query like this
var rand = new Random();
var randomIndex1 = rand.Next(0, 31);
var randomIndex2 = rand.Next(0, 31);
var taskSet = await DbContext.MainClasses
.Include(m => m.SubClass1)
.ThenInclude(s => s.SubClass2)
.OrderBy(m => m.Randomizer.ToString().Replace("-", "")[randomIndex1])
.ThenBy(m => m.Randomizer.ToString().Replace("-", "")[randomIndex2])
.FirstOrDefaultAsync();
This seems to be working well enough, and should provide enough entropy for even a large dataset to be fairly randomized.
tl;dr: There's a leaky abstraction here. To us, Include
is a simple instruction to stick a collection of things onto each single returned Person
row. But EF's implementation of Include
is done by returning a whole row for each Person-Address
combo, and reassembling at the client. Ordering by a volatile value causes those rows to become shuffled, breaking apart the Person
groups that EF is relying on.
When we have a look at ToTraceString()
for this LINQ:
var people = c.People.Include("Addresses");
// Note: no OrderBy in sight!
we see
SELECT
[Project1].[Id] AS [Id],
[Project1].[Name] AS [Name],
[Project1].[C1] AS [C1],
[Project1].[Id1] AS [Id1],
[Project1].[Data] AS [Data],
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent2].[Id] AS [Id1],
[Extent2].[PersonId] AS [PersonId],
[Extent2].[Data] AS [Data],
CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
FROM [Person] AS [Extent1]
LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
) AS [Project1]
ORDER BY [Project1].[Id] ASC, [Project1].[C1] ASC
So we get n
rows for each A
, plus 1
row for each P
without any A
s.
Adding an OrderBy
clause, however, puts the thing-to-order-by at the start of the ordered columns:
var people = c.People.Include("Addresses").OrderBy(p => Guid.NewGuid());
gives
SELECT
[Project1].[Id] AS [Id],
[Project1].[Name] AS [Name],
[Project1].[C2] AS [C1],
[Project1].[Id1] AS [Id1],
[Project1].[Data] AS [Data],
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT
NEWID() AS [C1],
[Extent1].[Id] AS [Id],
[Extent1].[Name] AS [Name],
[Extent2].[Id] AS [Id1],
[Extent2].[PersonId] AS [PersonId],
[Extent2].[Data] AS [Data],
CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
FROM [Person] AS [Extent1]
LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
) AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[Id] ASC, [Project1].[C2] ASC
So in your case, where the ordered-by-thing is not a property of a P
, but is instead volatile, and therefore can be different for different P-A
records of the same P
, the whole thing falls apart.
I'm not sure where on the working-as-intended ~~~ cast-iron bug
continuum this behaviour falls. But at least now we know about it.
I dont think there is an issue in query generation, but there is definately an issue when EF tries to convert rows into object.
It looks like there is an inherent assumption here that data for the same person in a joined statement will be returned grouped together order by or not.
for example the result of a joined query will always be
P.Id P.Name A.Id A.StreetLine
1 Person 1 10 ---
1 Person 1 11
2 Person 2 12
3 Person 3 13
3 Person 3 14
even if you order by some other column, same person would always appear one after the other.
this assumption is mostly true for any joined query.
But there is a deeper issue here i think. OrderBy is for when you want data in certain order ( as opposite to random), so that assumption does seem reasonable.
i think you should really get data out and then randomize it according to some other means in your code
From theory: To sort a list of items, the compare function should be stable relative to items; this means that for any 2 items x, y the result of x< y should be the same as many time is queried(called).
I think the issue is related to misunderstanding of specification(documentation) of OrderBy method: keySelector - A function to extract a key from an element.
EF didn't mention explicitly if the provided function should return the same value for same object as many times is called (in your case returns different/random values), but I think the "key" term that they used in documentation implicitly suggested this.
When you define a query path to define the query results, (use Include), the query path is only valid on the returned instance of ObjectQuery. Other instances of ObjectQuery and the object context itself are not affected. This functionality lets you chain multiple "Includes" for eager loading.
Therefor, Your statement translates into
from person in db.Persons.Include(p => p.Addresses).OrderBy(p => Guid.NewGuid())
select person
instead of what you intended.
from person in db.Persons.Include(p => p.Addresses)
select person
.OrderBy(p => Guid.NewGuid())
Hence your second workaround works fine :)
Reference: Loading Related Objects While Querying A Conceptual Model in Entity Framework - http://msdn.microsoft.com/en-us/library/bb896272.aspx
As one can sort it out by reading AakashM answer and Nicolae Dascalu answer, it strongly seems Linq OrderBy
requires a stable ranking function, which NewID/Guid.NewGuid
is not.
So we have to use another random generator that would be stable inside a single query.
To achieve this, before each querying, use a .Net Random generator to get a random number. Then combine this random number with a unique property of the entity to get randomly sorted. And to 'randomize' a bit the result, checksum
it. (checksum
is a SQL Server function that compute a hash; original idea founded on this blog.)
Assuming Person
Id
is an int
, you could write your query this way :
var rnd = (new Random()).NextDouble();
var persons = db.Persons
.Include(p => p.Addresses)
.OrderBy(p => SqlFunctions.Checksum(p.Id * rnd));
Like the NewGuid
hack, this is very probably not a good random generator with a good distribution and so on. But it does not cause entities to get duplicated in results.
Beware:
If your query ordering does not guarantees uniqueness of your entities ranking, you must complement it for guarantying it. By example, if you use a non-unique property of your entities for the checksum call, then add something like .ThenBy(p => p.Id)
after the OrderBy
.
If your ranking is not unique for your queried root entity, its included children may get mixed with children of other entities having the same ranking. And then the bug will stay here.
Note:
I would prefer use .Next()
method to get an int
then combine it through a xor (^
) to an entity int
unique property, rather than using a double
and multiply it. But SqlFunctions.Checksum unfortunately does not provide an overload for int
data type, though the SQL server function is supposed to support it. You may use a cast to overcome this, but for keeping it simple I finally had chosen to go with the multiply.