How to read only lines that fulfil a condition from a csv into R?

前端 未结 5 1293
猫巷女王i
猫巷女王i 2020-11-27 16:03

I am trying to read a large csv file into R. I only want to read and work with some of the rows that fulfil a particular condition (e.g. Variable2 >= 3). Thi

相关标签:
5条回答
  • 2020-11-27 16:39

    You can read the file in chunks, process each chunk, and then stitch only the subsets together.

    Here is a minimal example assuming the file has 1001 (incl. the header) lines and only 100 will fit into memory. The data has 3 columns, and we expect at most 150 rows to meet the condition (this is needed to pre-allocate the space for the final data:

    # initialize empty data.frame (150 x 3)
    max.rows <- 150
    final.df <- data.frame(Variable1=rep(NA, max.rows=150), 
                           Variable2=NA,  
                           Variable3=NA)
    
    # read the first chunk outside the loop
    temp <- read.csv('big_file.csv', nrows=100, stringsAsFactors=FALSE)
    temp <- temp[temp$Variable2 >= 3, ]  ## subset to useful columns
    final.df[1:nrow(temp), ] <- temp     ## add to the data
    last.row = nrow(temp)                ## keep track of row index, incl. header
    
    for (i in 1:9){    ## nine chunks remaining to be read
      temp <- read.csv('big_file.csv', skip=i*100+1, nrow=100, header=FALSE,
                       stringsAsFactors=FALSE)
      temp <- temp[temp$Variable2 >= 3, ]
      final.df[(last.row+1):(last.row+nrow(temp)), ] <- temp
      last.row <- last.row + nrow(temp)    ## increment the current count
    }
    
    final.df <- final.df[1:last.row, ]   ## only keep filled rows
    rm(temp)    ## remove last chunk to free memory
    

    Edit: Added stringsAsFactors=FALSE option on @lucacerone's suggestion in the comments.

    0 讨论(0)
  • 2020-11-27 16:40

    I was looking into readr::read_csv_chunked when I saw this question and thought I would do some benchmarking. For this example, read_csv_chunked does well and increasing the chunk size was beneficial. sqldf was only marginally faster than awk.

    library(tidyverse)
    library(sqldf)
    library(data.table)
    library(microbenchmark)
    
    # Generate an example dataset with two numeric columns and 5 million rows
    tibble(
      norm = rnorm(5e6, mean = 5000, sd = 1000),
      unif = runif(5e6, min = 0, max = 10000)
    ) %>%
      write_csv('medium.csv')
    
    microbenchmark(
      readr  = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F),
      readr2 = read_csv_chunked('medium.csv', callback = DataFrameCallback$new(function(x, pos) subset(x, unif > 9000)), col_types = 'dd', progress = F, chunk_size = 1000000),
      sqldf  = read.csv.sql('medium.csv', sql = 'select * from file where unif > 9000', eol = '\n'),
      awk    = read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv")),
      awk2   = read_csv(pipe("awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"), col_types = 'dd', progress = F),
      fread  = fread(cmd = "awk 'BEGIN {FS=\",\"} {if ($2 > 9000) print $0}' medium.csv"),
      check  = function(values) all(sapply(values[-1], function(x) all.equal(values[[1]], x))),
      times  = 10L
    )
    
    # Updated 2020-05-29
    
    # Unit: seconds
    #   expr   min    lq  mean  median    uq   max neval
    #  readr   2.6   2.7   3.1     3.1   3.5   4.0    10
    # readr2   2.3   2.3   2.4     2.4   2.6   2.7    10
    #  sqldf  14.1  14.1  14.7    14.3  15.2  16.0    10
    #    awk  18.2  18.3  18.7    18.5  19.3  19.6    10
    #   awk2  18.1  18.2  18.6    18.4  19.1  19.4    10
    #  fread  17.9  18.0  18.2    18.1  18.2  18.8    10
    
    # R version 3.6.2 (2019-12-12)
    # macOS Mojave 10.14.6        
    
    # data.table 1.12.8
    # readr      1.3.1 
    # sqldf      0.4-11
    
    0 讨论(0)
  • 2020-11-27 16:41

    You can open the file in read mode using the function file (e.g. file("mydata.csv", open = "r")).

    You can read the file one line at a time using the function readLines with option n = 1, l = readLines(fc, n = 1).

    Then you have to parse your string using function such as strsplit, regular expressions, or you can try the package stringr (available from CRAN).

    If the line met the conditions to import the data, you import it.

    To summarize I would do something like this:

    df = data.frame(var1=character(), var2=int(), stringsAsFactors = FALSE)
    fc = file("myfile.csv", open = "r")
    
    i = 0
    while(length( (l <- readLines(fc, n = 1) ) > 0 )){ # note the parenthesis surrounding l <- readLines..
    
       ##parse l here: and check whether you need to import the data.
    
       if (need_to_add_data){
         i=i+1
         df[i,] = #list of data to import
      }
    
    }
    
    0 讨论(0)
  • 2020-11-27 16:43

    By far the easiest (in my book) is to use pre-processing.

    R> DF <- data.frame(n=1:26, l=LETTERS)
    R> write.csv(DF, file="/tmp/data.csv", row.names=FALSE)
    R> read.csv(pipe("awk 'BEGIN {FS=\",\"} {if ($1 > 20) print $0}' /tmp/data.csv"),
    +           header=FALSE)
      V1 V2
    1 21  U
    2 22  V
    3 23  W
    4 24  X
    5 25  Y
    6 26  Z
    R> 
    

    Here we use awk. We tell awk to use a comma as a field separator, and then use the conditon 'if first field greater than 20' to decide if we print (the whole line via $0).

    The output from that command can be read by R via pipe().

    This is going to be faster and more memory-efficient than reading everythinb into R.

    0 讨论(0)
  • 2020-11-27 16:51

    You could use the read.csv.sql function in the sqldf package and filter using SQL select. From the help page of read.csv.sql:

    library(sqldf)
    write.csv(iris, "iris.csv", quote = FALSE, row.names = FALSE)
    iris2 <- read.csv.sql("iris.csv", 
        sql = "select * from file where `Sepal.Length` > 5", eol = "\n")
    
    0 讨论(0)
提交回复
热议问题