I have a large number of csv files that I want to read into R. All the column headings in the csvs are the same. But I want to import only those rows from each file into the
If you are really stuck for memory then the following solution might work. It uses LaF
to read only the column needed for filtering; then calculates the total number of lines that will be read; initialized the complete data.frame and then read the required lines from the files. (It's probably not faster than the other solutions)
library("LaF")
colnames <- c("v1","v2","v3")
colclasses <- c("character", "character", "numeric")
fileNames <- list.files(pattern = "*.csv")
# First determine which lines to read from each file and the total number of lines
# to be read
lines <- list()
for (fn in fileNames) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
d <- laf$v3[]
lines[[fn]] <- which(d > 2 & d < 7)
}
nlines <- sum(sapply(lines, length))
# Initialize data.frame
df <- as.data.frame(lapply(colclasses, do.call, list(nlines)),
stringsAsFactors=FALSE)
names(df) <- colnames
# Read the lines from the files
i <- 0
for (fn in names(lines)) {
laf <- laf_open_csv(fn, column_types=colclasses, column_names=colnames, skip=1)
n <- length(lines[[fn]])
df[seq_len(n) + i, ] <- laf[lines[[fn]], ]
i <- i + n
}
Here is an approach using data.table
which will allow you to use fread
(which is faster than read.csv) and rbindlist
which is a superfast implementation of do.call(rbind, list(..)) perfect for this situation. It also has a function between
library(data.table)
fileNames <- list.files(path = workDir)
alldata <- rbindlist(lapply(fileNames, function(x,mon,max) {
xx <- fread(x, sep = ',')
xx[, fileID := gsub(".csv.*", "", x)]
xx[between(v3, lower=min, upper = max, incbounds = FALSE)]
}, min = 2, max = 3))
If the individual files are large and v1
always integer values it might be worth setting v3
as a key then using a binary search, it may also be quicker to import everything and then run the filtering.
If you want to do "filtering" before importing the data try to use read.csv.sql
from sqldf package