问题
I am trying to download the dataset at the below link. It is about 14,000,000 rows long. I ran this code chunk, and I am stuck at unzip(). The code has been running for a really long time and my computer is hot.
I tried a few different ways that don't use unzip, and then I get stuck at the read.csv/vroom/read_csv step. Any ideas? This is a public dataset so anyone can try.
library(vroom)
temp <- tempfile()
download.file("https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip", temp)
unzip(temp, "hmda_2017_nationwide_all-records_labels.csv")
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.csv")
unlink(temp)
回答1:
Since the data set is quite large, 2 possible solutions:
With data.table (very fast, only feasible if the data fits into memory)
require(data.table)
system('curl https://files.consumerfinance.gov/hmda-historic-loan-data/hmda_2017_nationwide_all-records_labels.zip > hmda_2017_nationwide_all-records_labels.zip && unzip hmda_2017_nationwide_all-records_labels.zip')
dat <- fread("hmda_2017_nationwide_all-records_labels.csv")
# System errno 22 unmapping file: Invalid argument
# Error in fread("hmda_2017_nationwide_all-records_labels.csv") :
# Opened 10.47GB (11237068086 bytes) file ok but could not memory map it.
# This is a 64bit process. There is probably not enough contiguous virtual memory available.
With readLines (read data step-wise)
f <- file("./hmda_2017_nationwide_all-records_labels.csv", "r")
# if header:
header <- unlist(strsplit(unlist(strsplit(readLines(f, n=1), "\",\"")), ","))
dd <- as.data.frame(t(data.frame(strsplit(readLines(f, n=100), "\",\"") )))
colnames(dd) <- header
rownames(dd) <- 1:nrow(dd)
Repeat and add to the data frame if needed:
de <- t(as.data.frame( strsplit(readLines(f, n=10), "\",\"") ) )
colnames(de) <- header
dd <- rbind( dd, de )
rownames(dd) <- 1:nrow(dd)
close(f)
Use seek
to jump within the data.
回答2:
I was able to download the file to my computer first.
then use vroom (https://vroom.r-lib.org/) to load it without unzipping it:
library(vroom)
df2017 <- vroom("hmda_2017_nationwide_all-records_labels.zip")
I get a warning about possible truncation, but the object has these dimensions:
> dim(df2017)
[1] 5448288 78
one nice thing about vroom, is that it doesn't load the data straight into memory.
来源:https://stackoverflow.com/questions/65401851/fast-way-to-download-a-really-big-14-million-row-csv-from-a-zip-file-unzip-an