read.csv is extremely slow in reading csv files with large numbers of columns

后端 未结 5 606
栀梦
栀梦 2020-12-08 11:29

I have a .csv file: example.csv with 8000 columns x 40000 rows. The csv file have a string header for each column. All fields contains integer values between 0 and 10. When

相关标签:
5条回答
  • 2020-12-08 11:53

    Try using fread{data.table}. This is by far on of the fastest ways to read .csv files into R. There is a good benchmark here.

    library(data.table)
    
    data <- fread("c:/data.csv")
    

    If you want to make it even faster, you can also read only the subset of columns you want to use:

    data <- fread("c:/data.csv", select = c("col1", "col2", "col3"))
    
    0 讨论(0)
  • 2020-12-08 12:05

    If you'll read the file often, it might well be worth saving it from R in a binary format using the save function. Specifying compress=FALSE often results in faster load times.

    ...You can then load it in with the (surprise!) load function.

    d <- as.data.frame(matrix(1:1e6,ncol=1000))
    write.csv(d, "c:/foo.csv", row.names=FALSE)
    
    # Load file with read.csv
    system.time( a <- read.csv("c:/foo.csv") ) # 3.18 sec
    
    # Load file using scan
    system.time( b <- matrix(scan("c:/foo.csv", 0L, skip=1, sep=','), 
                             ncol=1000, byrow=TRUE) ) # 0.55 sec
    
    # Load (binary) file using load
    save(d, file="c:/foo.bin", compress=FALSE)
    system.time( load("c:/foo.bin") ) # 0.09 sec
    
    0 讨论(0)
  • 2020-12-08 12:08

    Might be worth it to try the new vroom package

    vroom is a new approach to reading delimited and fixed width data into R.

    It stems from the observation that when parsing files reading data from disk and finding the delimiters is generally not the main bottle neck. Instead (re)-allocating memory and parsing the values into R data types (particularly for characters) takes the bulk of the time.

    Therefore you can obtain very rapid input by first performing a fast indexing step and then using the ALTREP (ALTernative REPresentations) framework available in R versions 3.5+ to access the values in a lazy / delayed fashion.

    This approach potentially also allows you to work with data that is larger than memory. As long as you are careful to avoid materializing the entire dataset at once it can be efficiently queried and subset.

    #install.packages("vroom", 
    #                 dependencies = TRUE, repos = "https://cran.rstudio.com")
    library(vroom)
    
    df <- vroom('example.csv')
    

    Benchmark: readr vs data.table vs vroom for a 1.57GB file

    0 讨论(0)
  • 2020-12-08 12:11

    Also try Hadley Wickham's readr package:

    library(readr) 
    data <- read_csv("file.csv")
    
    0 讨论(0)
  • 2020-12-08 12:16

    If your CSV only contains integers, you should use scan instead of read.csv, since ?read.csv says:

     ‘read.table’ is not the right tool for reading large matrices,
     especially those with many columns: it is designed to read _data
     frames_ which may have columns of very different classes.  Use
     ‘scan’ instead for matrices.
    

    Since your file has a header, you will need skip=1, and it will probably be faster if you set what=integer(). If you must use read.csv and speed / memory consumption are a concern, setting the colClasses argument is a huge help.

    0 讨论(0)
提交回复
热议问题