Quickly reading very large tables as dataframes

后端 未结 11 1772
清歌不尽
清歌不尽 2020-11-21 04:46

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like the

相关标签:
11条回答
  • 2020-11-21 05:08

    A minor additional points worth mentioning. If you have a very large file you can on the fly calculate the number of rows (if no header) using (where bedGraph is the name of your file in your working directory):

    >numRow=as.integer(system(paste("wc -l", bedGraph, "| sed 's/[^0-9.]*\\([0-9.]*\\).*/\\1/'"), intern=T))
    

    You can then use that either in read.csv , read.table ...

    >system.time((BG=read.table(bedGraph, nrows=numRow, col.names=c('chr', 'start', 'end', 'score'),colClasses=c('character', rep('integer',3)))))
       user  system elapsed 
     25.877   0.887  26.752 
    >object.size(BG)
    203949432 bytes
    
    0 讨论(0)
  • 2020-11-21 05:10

    An alternative is to use the vroom package. Now on CRAN. vroom doesn't load the entire file, it indexes where each record is located, and is read later when you use it.

    Only pay for what you use.

    See Introduction to vroom, Get started with vroom and the vroom benchmarks.

    The basic overview is that the initial read of a huge file, will be much faster, and subsequent modifications to the data may be slightly slower. So depending on what your use is, it could be the best option.

    See a simplified example from vroom benchmarks below, the key parts to see is the super fast read times, but slightly sower operations like aggregate etc..

    package                 read    print   sample   filter  aggregate   total
    read.delim              1m      21.5s   1ms      315ms   764ms       1m 22.6s
    readr                   33.1s   90ms    2ms      202ms   825ms       34.2s
    data.table              15.7s   13ms    1ms      129ms   394ms       16.3s
    vroom (altrep) dplyr    1.7s    89ms    1.7s     1.3s    1.9s        6.7s
    
    0 讨论(0)
  • 2020-11-21 05:10

    Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

    (1) The data stays loaded in your database. You simply reconnect in pgadmin to the databases you want when you turn your laptop back on.

    (2) It is true R can do many more nifty statistical and graphing operations than SQL. But I think SQL is better designed to query large amounts of data than R.

    # Looking at Voter/Registrant Age by Decade
    
    library(RPostgreSQL);library(lattice)
    
    con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                     port="2345", host="localhost", dbname="WC2014_08_01_2014")
    
    Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
    
    Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")
    
    with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
    mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)
    
    with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
    mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)
    
    0 讨论(0)
  • 2020-11-21 05:11

    Strangely, no one answered the bottom part of the question for years even though this is an important one -- data.frames are simply lists with the right attributes, so if you have large data you don't want to use as.data.frame or similar for a list. It's much faster to simply "turn" a list into a data frame in-place:

    attr(df, "row.names") <- .set_row_names(length(df[[1]]))
    class(df) <- "data.frame"
    

    This makes no copy of the data so it's immediate (unlike all other methods). It assumes that you have already set names() on the list accordingly.

    [As for loading large data into R -- personally, I dump them by column into binary files and use readBin() - that is by far the fastest method (other than mmapping) and is only limited by the disk speed. Parsing ASCII files is inherently slow (even in C) compared to binary data.]

    0 讨论(0)
  • 2020-11-21 05:15

    Instead of the conventional read.table I feel fread is a faster function. Specifying additional attributes like select only the required columns, specifying colclasses and string as factors will reduce the time take to import the file.

    data_frame <- fread("filename.csv",sep=",",header=FALSE,stringsAsFactors=FALSE,select=c(1,4,5,6,7),colClasses=c("as.numeric","as.character","as.numeric","as.Date","as.Factor"))
    
    0 讨论(0)
  • 2020-11-21 05:16

    An update, several years later

    This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

    1. Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.

    2. Using fread in data.table for importing data from csv/tab-delimited files directly into R. See mnel's answer.

    3. Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

    4. read.csv.raw from iotools provides a third option for quickly reading CSV files.

    5. Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

    6. Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.


    The original answer

    There are a couple of simple things to try, whether you use read.table or scan.

    1. Set nrows=the number of records in your data (nmax in scan).

    2. Make sure that comment.char="" to turn off interpretation of comments.

    3. Explicitly define the classes of each column using colClasses in read.table.

    4. Setting multi.line=FALSE may also improve performance in scan.

    If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

    The other alternative is filtering your data before you read it into R.

    Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

    0 讨论(0)
提交回复
热议问题