How to read specific rows of CSV file with fread function

后端 未结 1 537
一向
一向 2020-12-18 06:26

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by

相关标签:
1条回答
  • 2020-12-18 07:27

    This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

    If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

    # create sample dataset
    set.seed(1)
    m   <- matrix(rnorm(1e5),ncol=10)
    csv <- data.frame(x=1:1e4,m)
    write.csv(csv,"test.csv")
    # s: rows we want to read
    s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
    # v: logical, T means read this row (equivalent to your read_vec)
    v <- (1:1e4 %in% s)
    
    seq  <- rle(v)
    idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
    # indx: start = starting row of sequence, length = length of sequence (compare to s)
    indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
    
    library(data.table)
    result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))
    
    0 讨论(0)
提交回复
热议问题