Filtering multiple csv files while importing into data frame

前端 未结 3 1035
攒了一身酷
攒了一身酷 2021-01-03 11:18

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the

相关标签:
3条回答
  • 2021-01-03 11:30

    If you are really stuck for memory then the following solution might work. It uses LaF to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)

    library("LaF")
    
    colnames <- c("v1","v2","v3")
    colclasses <- c("character", "character", "numeric")
    
    fileNames <- list.files(pattern = "*.csv")
    
    # First determine which lines to read from each file and the total number of lines
    # to be read
    lines <- list()
    for (fn in fileNames) {
      laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
      d   <- laf$v3[] 
      lines[[fn]] <- which(d > 2 & d < 7)
    }
    nlines <- sum(sapply(lines, length))
    
    # Initialize data.frame
    df <- as.data.frame(lapply(colclasses, do.call, list(nlines)), 
            stringsAsFactors=FALSE)
    names(df) <- colnames
    
    # Read the lines from the files
    i <- 0
    for (fn in names(lines)) {
      laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
      n   <- length(lines[[fn]])
      df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
      i   <- i + n
    }
    
    0 讨论(0)
  • 2021-01-03 11:32

    Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between

    library(data.table)
    fileNames <- list.files(path = workDir)
    alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
      xx <- fread(x, sep = ',')
      xx[, fileID :=   gsub(".csv.*", "", x)]
      xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
      }, min = 2, max = 3))
    

    If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.

    0 讨论(0)
  • 2021-01-03 11:50

    If you want to do "filtering" before importing the data try to use read.csv.sql from sqldf package

    0 讨论(0)
提交回复
热议问题