问题
Disclaimer:
Many of you pointed to a duplicated post, I was aware of it but I believe it's not a fair duplicate as some way of saving/loading might be different for data frames and lists. For instance the packages fst
and feather
work on data frames but not on lists.
My question is specific to lists.
I have a ~50M element list and I'd like to save it to a file to share it among different R sessions.
I know the native ways of saving in R (save
, save.image
, saveRDS
). My point was : would you still use these functions on big scale data?
What is the fastest way to save it and read it back? (any R readable format would be alright).
回答1:
After some research it appears that there is no real alternative to the base saveRDS
function and not many packages dealing with large lists.
Saving a list as a column of a data.table/data.frame doesn't works with the packages fst
and feather
, it works with the package data.table
. However when reading it back it becomes a character compelling the use of strsplit
or its fastest alternative str_split
.
The only package directly focused on lists that i could find was rlist
, however it does not speed up list reading or writing from/to a file when compared to the base functions saveRDS
, readRDS
.
Benchmarks:
l <- lapply(1:10000000, function (x) {rnorm(sample(1:5, size = 1, replace = T))} )
dt_l <- data.table(l = as.list(l))
microbenchmark::microbenchmark(times = 5L,
"data.table" = { fwrite(dt_l, "dt_l.csv")
dt_l <- fread("dt_l.csv", sep = ",", sep2 = "\\|")
l_load <- str_split(dt_l$l, "\\|")
},
"rlist" = { list.save(l, "l.rds")
l_load <- list.load("l.rds")
},
"RDS_base" = { saveRDS(l, "l.rds")
l_load <- readRDS("l.rds")
}
)
Unit: seconds
expr min lq mean median uq max neval
data.table 18.30548 18.67964 18.98801 19.17744 19.19791 19.57956 5
RDS_list.save 16.80936 16.81615 16.86114 16.84012 16.91770 16.92236 5
RDS_base 16.90403 17.23784 18.62475 19.48391 19.60365 19.89431 5
来源:https://stackoverflow.com/questions/51619320/how-can-i-efficiently-save-and-load-a-big-list