问题
I have a few hundred thousand very small .dat.gz
files that I want to read into R in the most efficient way possible. I read in the file and then immediately aggregate and discard the data, so I am not worried about managing memory as I get near the end of the process. I just really want to speed up the bottleneck, which happens to be unzipping and reading in the data.
Each dataset consists of 366 rows and 17 columns. Here is a reproducible example of what I am doing so far:
Building reproducible data:
require(data.table)
# Make dir
system("mkdir practice")
# Function to create data
create_write_data <- function(file.nm) {
dt <- data.table(Day=0:365)
dt[, (paste0("V", 1:17)) := lapply(1:17, function(x) rnorm(n=366))]
write.table(dt, paste0("./practice/",file.nm), row.names=FALSE, sep="\t", quote=FALSE)
system(paste0("gzip ./practice/", file.nm))
}
And here is code applying:
# Apply function to create 10 fake zipped data.frames (550 kb on disk)
tmp <- lapply(paste0("dt", 1:10,".dat"), function(x) create_write_data(x))
And here is my most efficient code so far to read in the data:
# Function to read in files as fast as possible
read_Fast <- function(path.gz) {
system(paste0("gzip -d ", path.gz)) # Unzip file
path.dat <- gsub(".gz", "", path.gz)
dat_run <- fread(path.dat)
}
# Apply above function
dat.files <- list.files(path="./practice", full.names = TRUE)
system.time(dat.list <- rbindlist(lapply(dat.files, read_Fast), fill=TRUE))
dat.list
I have bottled this up in a function and applied it in parallel, but it is still much much too slow for what I need this for.
I have already tried the h2o.importFolder
from the wonderful h2o
package, but it is actually much much slower compared to using plain R
with data.table
. Maybe there is a way to speed up the unzipping of files, but I am unsure. From the few times that I have run this, I have noticed that the unzipping of the files usually takes about 2/3rd of the function time.
回答1:
I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.
tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl
回答2:
R has the ability to read gzipped files natively, using the gzfile
function. See if this works.
rbindlist(lapply(dat.files, function(f) {
read.delim(gzfile(f))
}))
回答3:
The bottleneck might be caused by the use of the system() call to an external application.
You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R
来源:https://stackoverflow.com/questions/35763574/fastest-way-to-read-in-100-000-dat-gz-files