How to read specific rows of CSV file with fread function

后端未结

关注

 1  537

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by

相关标签:

1条回答

闹比i

2020-12-18 07:27

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

# create sample dataset
set.seed(1)
m   <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)

seq  <- rle(v)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])

library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

0 讨论(0)