Edit 3:
I created a much shorter example of the memory leak. I hope it makes it much easier to reason about what\'s going on. As the iterations proceed, you see ste
UPDATE - Now fixed in v1.8.11. From NEWS :
Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel) may have suffered, #2648. Tests added.
Many thanks to vc273, Y T and others.
The particular (great) example at the top of this question is considered a "non-trivial" aggregation where the result of each group can be a different number of rows, not just a single aggregated in one row. Adding verbose=TRUE
reveals :
Wrote less rows (4000000) than allocated (4488000).
and that's where the leak was in this case. Only matters if you need to repeat grouping many times, as is needed sometimes. The result was correct.
Previous answer retained for posterity ...
Consider this part :
#now add many columns
for (i in 1:100){
DT[[sprintf('col%s',i)]] = 1:nrow(DT);
}
That isn't using :=
or set()
which are the data.table
provided ways of adding columns by reference. =
is the same as <-
; i.e., on each and every iteration of this for
loop the entire DT
will be copied to make room for the single extra column. The memory leak you describe would be consistent with this for
loop.
Some options are :
cbind
:=
e.g. DT[,sprintf('col%s',1:100):=1:nrow(DT)]
for
loop but use :=
or set()
on each iterationI haven't actually run your code to check so there may be other problems later as well.
UPDATE : I have now run your code and I think I might be able to guess what you mean about memory use. But guessing can use up a lot of time, especially in areas like this. Can you please expand significantly upon this :
I see a steadily increasing memory use, which seems like a memory leak.
What precisely do you see; i.e., what are the numbers? What does it start at and what does it end at? How many times did you run it? Please also provide the output of sessionInfo()
; although you give the version of R (2.13.0) which is helpful, it helps to know if you are 32bit or 64bit Linux, Mac or Windows as well.