Slow memory leak in data.table when returning named lists in j (trying to reshape a data.table)

后端 未结 1 1049
天涯浪人
天涯浪人 2021-01-11 12:58

Edit 3:

I created a much shorter example of the memory leak. I hope it makes it much easier to reason about what\'s going on. As the iterations proceed, you see ste

相关标签:
1条回答
  • 2021-01-11 13:46

    UPDATE - Now fixed in v1.8.11. From NEWS :

    Long outstanding (usually small) memory leak in grouping fixed. When the last group is smaller than the largest group, the difference in those sizes was not being released. Also in non-trivial aggregations where each group returns a different number of rows. Most users run a grouping query once and will never have noticed, but anyone looping calls to grouping (such as when running in parallel) may have suffered, #2648. Tests added.

    Many thanks to vc273, Y T and others.


    The particular (great) example at the top of this question is considered a "non-trivial" aggregation where the result of each group can be a different number of rows, not just a single aggregated in one row. Adding verbose=TRUE reveals :

    Wrote less rows (4000000) than allocated (4488000).

    and that's where the leak was in this case. Only matters if you need to repeat grouping many times, as is needed sometimes. The result was correct.


    Previous answer retained for posterity ...

    Consider this part :

    #now add many columns
    for (i in 1:100){
        DT[[sprintf('col%s',i)]] = 1:nrow(DT);
    }
    

    That isn't using := or set() which are the data.table provided ways of adding columns by reference. = is the same as <-; i.e., on each and every iteration of this for loop the entire DT will be copied to make room for the single extra column. The memory leak you describe would be consistent with this for loop.

    Some options are :

    • Add the many columns in one go using cbind
    • Add the columns in one go using := e.g. DT[,sprintf('col%s',1:100):=1:nrow(DT)]
    • Keep the for loop but use := or set() on each iteration

    I haven't actually run your code to check so there may be other problems later as well.


    UPDATE : I have now run your code and I think I might be able to guess what you mean about memory use. But guessing can use up a lot of time, especially in areas like this. Can you please expand significantly upon this :

    I see a steadily increasing memory use, which seems like a memory leak.

    What precisely do you see; i.e., what are the numbers? What does it start at and what does it end at? How many times did you run it? Please also provide the output of sessionInfo(); although you give the version of R (2.13.0) which is helpful, it helps to know if you are 32bit or 64bit Linux, Mac or Windows as well.

    0 讨论(0)
提交回复
热议问题