I am working the a very large data set, roughly 2 million records. I have the code below but get an out of memory exception after it has process around three batches, about
The issue is that when you get data from EF there are actually two copies of the data created, one which is returned to the user and a second which EF holds onto and uses for change detection (so that it can persist changes to the database). EF holds this second set for the lifetime of the context and its this set thats running you out of memory.
You have 2 options to deal with this
Use .AsNoTracking() in your query eg:
IEnumerable<IEnumerable<Town>> towns = dbContext.Towns.AsNoTracking().OrderBy(t => t.TownID).Batch(200000);
this tells EF not to keep a copy for change detection. You can read a little more about what AsNoTracking does and the performance impacts of this on my blog: http://blog.staticvoid.co.nz/2012/4/2/entity_framework_and_asnotracking
I wrote a migration routine that reads from one DB and writes (with minor changes in layout) into another DB (of a different type) and in this case, renewing the connection for each batch and using AsNoTracking() did not cut it for me.
Note that this problem occurs using a '97 version of JET. It may work flawlessly with other DBs.
However, the following algorithm did solve the Out-of-memory issue:
every 50 rows or so written/updated, check the memory usage, recover memory + reset output DB context (and connected tables) as needed:
var before = System.Diagnostics.Process.GetCurrentProcess().VirtualMemorySize64;
if (before > 800000000)
{
dbcontextOut.SaveChanges();
dbcontextOut.Dispose();
GC.Collect();
GC.WaitForPendingFinalizers();
dbcontextOut = dbcontextOutFunc();
tableOut = Dynamic.InvokeGet(dbcontextOut, outputTableName);
}