问题
I am making various ggplot
s on a very large dataset (much larger than the examples). I created a binning function on both x- and y-axes to enable plotting of such large dataset.
In the following example, the memory.size()
is recorded at the start. Then the large dataset is simulated as dt
. dt
's x2
is plotted against x1
with binning. Plotting is repeated with different subsets of dt
. The size of the ploted object is checked by object.size()
and stored. After the plotting objects have been created, rm(dt)
is executed, followed by a double gc()
. At this point, memory.size()
is recorded again. At the end, the memory.size()
at the end is compared to that at the beginning and printed.
In view of the small size of the plotted object, it is expected that the memory.size()
at the end should be similar to that at the beginning. But no. memory.size()
does not go down anymore until I restart a new R session.
REPRODUCIBLE EXAMPLE
library(data.table)
library(ggplot2)
library(magrittr)
# The binning function
# x = column name for x-axis (character)
# y = column name for y-axis (character)
# xNItv = Number of bin for x-axis
# yNItv = Number of bin for y-axis
# Value: A binned data.table
tab_by_bin_idxy <- function(dt, x, y, xNItv, yNItv) {
#Binning
xBreaks = dt[, seq(min(get(x), na.rm = T), max(get(x), na.rm = T), length.out = xNItv + 1)]
yBreaks = dt[, seq(min(get(y), na.rm = T), max(get(y), na.rm = T), length.out = yNItv + 1)]
xbinCode = dt[, .bincode(get(x), breaks = xBreaks, include.lowest = T)]
xbinMid = sapply(seq(xNItv), function(i) {return(mean(xBreaks[c(i, i+1)]))})[xbinCode]
ybinCode = dt[, .bincode(get(y), breaks = yBreaks, include.lowest = T)]
ybinMid = sapply(seq(yNItv), function(i) {return(mean(yBreaks[c(i, i+1)]))})[ybinCode]
#Creating table
tab_match = CJ(xbinCode = seq(xNItv), ybinCode = seq(yNItv))
tab_plot = data.table(xbinCode, xbinMid, ybinCode, ybinMid)[
tab_match, .(xbinMid = xbinMid[1], ybinMid = ybinMid[1], N = .N), keyby = .EACHI, on = c("xbinCode", "ybinCode")
]
#Returning table
return(tab_plot)
}
before.mem.size <- memory.size()
# Simulation of dataset
nrow <- 6e5
ncol <- 60
dt <- do.call(data.table, lapply(seq(ncol), function(i) {return(runif(nrow))}) %>% set_names(paste0("x", seq(ncol))))
# Graph plotting
dummyEnv <- new.env()
with(dummyEnv, {
fcn <- function(tab) {
binned.dt <- tab_by_bin_idxy(dt = tab, x = "x1", y = "x2", xNItv = 50, yNItv = 50)
plot <- ggplot(binned.dt, aes(x = xbinMid, y = ybinMid)) + geom_point(aes(size = N))
return(plot)
}
lst_plots <- list(
plot1 = fcn(dt),
plot2 = fcn(dt[x1 <= 0.7]),
plot3 = fcn(dt[x5 <= 0.3])
)
assign("size.of.plots", object.size(lst_plots), envir = .GlobalEnv)
})
rm(dummyEnv)
# After use, remove and clean up of dataset
rm(dt)
gc();gc()
after.mem.size <- memory.size()
# Memory reports
print(paste0("before.mem.size = ", before.mem.size))
print(paste0("after.mem.size = ", after.mem.size))
print(paste0("plot.objs.size = ", size.of.plots / 1000000))
I have tried the following modifications to the code:
- Inside
fcn
, removingggplot
and returning aNULL
instead of a plot object: The memory leakage is totally gone. But this is not a solution. I need the plot. - The less plots requested / less columns / less rows passed to
fcn
, the less is the memory leakage. - Memory leakage also exists if I do not make any subset and make only one plot object (In the examples, I plotted 3).
- After the process, even after I call
rm(list = ls())
, the memory is still non-recoverable.
I wish to know why this happens and how to get rid of it without compromising my need to do binned plots and subset dt
to make different plots.
Thanks for attention!
来源:https://stackoverflow.com/questions/53312860/memory-leakage-in-using-ggplot-on-large-binned-datasets