Fastest way to read in 100,000 .dat.gz files

≡放荡痞女 提交于 2019-11-27 20:45:38

I'm sort of surprised that this actually worked. Hopefully it works for your case. I'm quite curious to know how speed compares to reading in compressed data from disk directly from R (albeit with a penalty for non-vectorization) instead.

tblNames = fread('cat *dat.gz | gunzip | head -n 1')[, colnames(.SD)]
tbl = fread('cat *dat.gz | gunzip | grep -v "^Day"')
setnames(tbl, tblNames)
tbl

R has the ability to read gzipped files natively, using the gzfile function. See if this works.

rbindlist(lapply(dat.files, function(f) {
    read.delim(gzfile(f))
}))
antbbn

The bottleneck might be caused by the use of the system() call to an external application.

You should try using the builting functions to extract the archive. This answer explains how: Decompress gz file using R

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!