问题
I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?
My suspicion is that the optimal choice is something like
saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
obj <- loadRDS('bigdata.Rda)
But this seems slower than using fread
function in the data.table
package. This should not be the case because fread
converts a file from CSV (although it is admittedly highly optimized).
Some timings for a ~800MB dataset are:
> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
user system elapsed
36.94 0.44 42.71
saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
user system elapsed
69.96 2.02 84.04
Previous Questions
This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.
There are also related questions about the fastest way of loading data once (eg: 1).
回答1:
It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).
If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.
来源:https://stackoverflow.com/questions/31887230/what-is-the-fastest-way-and-fastest-format-for-loading-large-data-sets-into-r