Filtering multiple csv files while importing into data frame

前端未结

关注

 3  1038

I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the

相关标签:

3条回答

臣服心动

2021-01-03 11:30

If you are really stuck for memory then the following solution might work. It uses LaF to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)

library("LaF")

colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")

fileNames <- list.files(pattern = "*.csv")

# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  d   <- laf$v3[] 
  lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))

# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)), 
        stringsAsFactors=FALSE)
names(df) <- colnames

# Read the lines from the files
i <- 0
for (fn in names(lines)) {
  laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
  n   <- length(lines[[fn]])
  df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
  i   <- i + n
}

0 讨论(0)

轻奢々

2021-01-03 11:32
Here is an approach using data.table which will allow you to use fread (which is faster than read.csv) and rbindlist which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between
```
library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
  xx <- fread(x, sep = ',')
  xx[, fileID :=   gsub(".csv.*", "", x)]
  xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
  }, min = 2, max = 3))
```
If the individual files are large and v1 always integer values it might be worth setting v3 as a key then using a binary search, it may also be quicker to import everything and then run the filtering.
0 讨论(0)
发布评论:

提交评论
- 加载中...
一向

2021-01-03 11:50

If you want to do "filtering" before importing the data try to use read.csv.sql from sqldf package

0 讨论(0)
发布评论:

提交评论
- 加载中...