I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most
does expanding the available memory with memory.limit(size=2000)
(or something bigger) help?
Increasing the memory with memory.limit() helped me ... This is for plotting with ggplot nearly 36K records.
The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.
For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.
Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, e.g. aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, e.g. only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.
plotting directly into a raster file device (calling png()
for instance) is a lot faster. I tried plotting rnorm(100000)
and on my laptop X11 cairo plot took 2.723 seconds, while png
device finished in 2.001 seconds. with 1 million points, the numbers are 27.095 and 19.954 seconds.
I use Fedora Linux and here is the code.
f = function(n){
x = rnorm(n)
y = rnorm(n)
png('test.png')
plot(x, y)
dev.off()}
g = function(n){
x = rnorm(n)
y = rnorm(n)
plot(x, y)}
system.time(f(100000))
system.time(g(100000))