Memory leakage in using `ggplot` on large binned datasets

问题

I am making various ggplots on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.

In the following example, the memory.size() is recorded at the start. Then the large dataset is simulated as dt. dt's x2 is plotted against x1 with binning. Plotting is repeated with different subsets of dt. The size of the ploted object is checked by object.size() and stored. After the plotting objects have been created, rm(dt) is executed, followed by a double gc(). At this point, memory.size() is recorded again. At the end, the memory.size() at the end is compared to that at the beginning and printed.

In view of the small size of the plotted object, it is expected that the memory.size() at the end should be similar to that at the beginning. But no. memory.size() does not go down anymore until I restart a new R session.

REPRODUCIBLE EXAMPLE

library(data.table)
library(ggplot2)
library(magrittr)

# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv) {
  #Binning
  xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
  yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
  xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
  xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
  ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
  ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
  #Creating table
  tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
  tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
    tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
    ]
  #Returning table
  return(tab_plot)
}

before.mem.size <- memory.size()

# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) {return(runif(nrow))}) %>% set_names(paste0("x", seq(ncol))))

# Graph plotting
dummyEnv <- new.env()
with(dummyEnv, {
  fcn <- function(tab) {
    binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
    plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
    return(plot)
  }
  lst_plots <- list(
    plot1 = fcn(dt),
    plot2 = fcn(dt[x1 <= 0.7]),
    plot3 = fcn(dt[x5 <= 0.3])
  )
  assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
})
rm(dummyEnv)

# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()

# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))

I have tried the following modifications to the code:

Inside fcn, removing ggplot and returning a NULL instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot.
The less plots requested / less columns / less rows passed to fcn, the less is the memory leakage.
Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).
After the process, even after I call rm(list = ls()), the memory is still non-recoverable.

I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt to make different plots.

Thanks for attention!

来源：https://stackoverflow.com/questions/53312860/memory-leakage-in-using-ggplot-on-large-binned-datasets

标签

memory

ggplot2

memory-leaks

data.table