I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3
). This is a much smaller dataset.
I want to read these lines directly into a dataframe, rather than load the whole dataset into a dataframe and then select according to the condition, since the whole dataset does not easily fit into memory.
You could use the read.csv.sql
function in the sqldf
package and filter using SQL select. From the help page of read.csv.sql
:
library(sqldf)
write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
iris2 <- read.csv.sql("iris.csv",
sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
By far the easiest (in my book) is to use pre-processing.
R> DF <- data.frame(n=1:26, l=LETTERS)
R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
+ header=FALSE)
V1 V2
1 21 U
2 22 V
3 23 W
4 24 X
5 25 Y
6 26 Z
R>
Here we use awk
. We tell awk
to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0
).
The output from that command can be read by R via pipe()
.
This is going to be faster and more memory-efficient than reading everythinb into R.
I was looking into readr::read_csv_chunked
when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked
does well and increasing the chunk size was beneficial. sqldf
was only marginally faster than awk
.
library(tidyverse)
library(sqldf)
library(microbenchmark)
# Generate an example dataset with two numeric columns and 5 million rows
data_frame(
norm = rnorm(5e6, mean = 5000, sd = 1000),
unif = runif(5e6, min = 0, max = 10000)
) %>%
write_csv('medium.csv')
microbenchmark(
readr = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
sqldf = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
awk = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
awk2 = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
check = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
times = 10L
)
# Unit: seconds
# expr min lq mean median uq max neval
# readr 5.58 5.79 6.16 5.98 6.68 7.12 10
# readr2 2.94 2.98 3.07 3.03 3.06 3.43 10
# sqldf 13.59 13.74 14.20 13.91 14.64 15.49 10
# awk 16.83 16.86 17.07 16.92 17.29 17.77 10
# awk2 16.86 16.91 16.99 16.92 16.97 17.57 10
You can read the file in chunks, process each chunk, and then stitch only the subsets together.
Here is a minimal example assuming the file has 1001 (incl. the header) lines and only 100 will fit into memory. The data has 3 columns, and we expect at most 150 rows to meet the condition (this is needed to pre-allocate the space for the final data:
# initialize empty data.frame (150 x 3)
max.rows <- 150
final.df <- data.frame(Variable1=rep(NA, max.rows=150),
Variable2=NA,
Variable3=NA)
# read the first chunk outside the loop
temp <- read.csv('big_file.csv', nrows=100, stringsAsFactors=FALSE)
temp <- temp[temp$Variable2 >= 3, ] ## subset to useful columns
final.df[1:nrow(temp), ] <- temp ## add to the data
last.row = nrow(temp) ## keep track of row index, incl. header
for (i in 1:9){ ## nine chunks remaining to be read
temp <- read.csv('big_file.csv', skip=i*100+1, nrow=100, header=FALSE,
stringsAsFactors=FALSE)
temp <- temp[temp$Variable2 >= 3, ]
final.df[(last.row+1):(last.row+nrow(temp)), ] <- temp
last.row <- last.row + nrow(temp) ## increment the current count
}
final.df <- final.df[1:last.row, ] ## only keep filled rows
rm(temp) ## remove last chunk to free memory
Edit: Added stringsAsFactors=FALSE
option on @lucacerone's suggestion in the comments.
You can open the file in read mode using the function file
(e.g. file("mydata.csv", open = "r")
).
You can read the file one line at a time using the function readLines
with option n = 1
, l = readLines(fc, n = 1)
.
Then you have to parse your string using function such as strsplit
, regular expressions, or you can try the package stringr
(available from CRAN).
If the line met the conditions to import the data, you import it.
To summarize I would do something like this:
df = data.frame(var1=character(), var2=int(), stringsAsFactors = FALSE)
fc = file("myfile.csv", open = "r")
i = 0
while(length( (l <- readLines(fc, n = 1) ) > 0 )){ # note the parenthesis surrounding l <- readLines..
##parse l here: and check whether you need to import the data.
if (need_to_add_data){
i=i+1
df[i,] = #list of data to import
}
}
来源:https://stackoverflow.com/questions/23197243/how-to-read-only-lines-that-fulfil-a-condition-from-a-csv-into-r