Plotting huge data files in R?

前端未结

关注

 5  766

I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most

相关标签:

5条回答

别那么骄傲

2021-02-09 01:27

does expanding the available memory with memory.limit(size=2000) (or something bigger) help?

0 讨论(0)
发布评论:

提交评论
- 加载中...
面向向阳花

2021-02-09 01:30

Increasing the memory with memory.limit() helped me ... This is for plotting with ggplot nearly 36K records.

0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2021-02-09 01:36

The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.

For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.

0 讨论(0)
发布评论:

提交评论
- 加载中...
不知归路

2021-02-09 01:38

Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, e.g. aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, e.g. only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.

0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2021-02-09 01:43
plotting directly into a raster file device (calling png() for instance) is a lot faster. I tried plotting rnorm(100000) and on my laptop X11 cairo plot took 2.723 seconds, while png device finished in 2.001 seconds. with 1 million points, the numbers are 27.095 and 19.954 seconds.

I use Fedora Linux and here is the code.
```
f = function(n){
x = rnorm(n)
y = rnorm(n)
png('test.png')
plot(x, y)
dev.off()}

g = function(n){
x = rnorm(n)
y = rnorm(n)
plot(x, y)}

system.time(f(100000))
system.time(g(100000))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...