Quickly reading very large tables as dataframes

后端 未结 11 1729
清歌不尽
清歌不尽 2020-11-21 04:46

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like the

相关标签:
11条回答
  • 2020-11-21 05:19

    Here is an example that utilizes fread from data.table 1.8.7

    The examples come from the help page to fread, with the timings on my windows XP Core 2 duo E8400.

    library(data.table)
    # Demo speedup
    n=1e6
    DT = data.table( a=sample(1:1000,n,replace=TRUE),
                     b=sample(1:1000,n,replace=TRUE),
                     c=rnorm(n),
                     d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                     e=rnorm(n),
                     f=sample(1:1000,n,replace=TRUE) )
    DT[2,b:=NA_integer_]
    DT[4,c:=NA_real_]
    DT[3,d:=NA_character_]
    DT[5,d:=""]
    DT[2,e:=+Inf]
    DT[3,e:=-Inf]
    

    standard read.table

    write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
    cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
    ## File size (MB): 51 
    
    system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
    ##    user  system elapsed 
    ##   24.71    0.15   25.42
    # second run will be faster
    system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
    ##    user  system elapsed 
    ##   17.85    0.07   17.98
    

    optimized read.table

    system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                              stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                              colClasses=c("integer","integer","numeric",                        
                                           "character","numeric","integer")))
    
    
    ##    user  system elapsed 
    ##   10.20    0.03   10.32
    

    fread

    require(data.table)
    system.time(DT <- fread("test.csv"))                                  
     ##    user  system elapsed 
    ##    3.12    0.01    3.22
    

    sqldf

    require(sqldf)
    
    system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             
    
    ##    user  system elapsed 
    ##   12.49    0.09   12.69
    
    # sqldf as on SO
    
    f <- file("test.csv")
    system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
    
    ##    user  system elapsed 
    ##   10.21    0.47   10.73
    

    ff / ffdf

     require(ff)
    
     system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
     ##    user  system elapsed 
     ##   10.85    0.10   10.99
    

    In summary:

    ##    user  system elapsed  Method
    ##   24.71    0.15   25.42  read.csv (first time)
    ##   17.85    0.07   17.98  read.csv (second time)
    ##   10.20    0.03   10.32  Optimized read.table
    ##    3.12    0.01    3.22  fread
    ##   12.49    0.09   12.69  sqldf
    ##   10.21    0.47   10.73  sqldf on SO
    ##   10.85    0.10   10.99  ffdf
    
    0 讨论(0)
  • 2020-11-21 05:24

    This was previously asked on R-Help, so that's worth reviewing.

    One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

    I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):

    str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
    cat(str)
    cols = list(key='',val=0)
    con <- textConnection(str, open = "r")
    hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
    close(con)
    

    The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.

    0 讨论(0)
  • 2020-11-21 05:30

    I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.

    Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.

    My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.

    This linked article provides benchmarking and a good overview. I have quoted some interesting points below.

    https://ursalabs.org/blog/2019-10-columnar-perf/

    File Size

    That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.

    Read Speed

    When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.


    Independent Test

    I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.

    Data

    library(dplyr)
    library(tibble)
    library(OpenRepGrid)
    
    n <- 1000000
    
    set.seed(1234)
    some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
    some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))
    
    
    test_data <- mtcars %>%
      rownames_to_column() %>%
      sample_n(n, replace = TRUE) %>%
      mutate_all(~ sample(., length(.))) %>%
      mutate(factor1 = sample(some_levels1, n, replace = TRUE),
             factor2 = sample(some_levels2, n, replace = TRUE),
             text = randomSentences(n, sample(3:8, n, replace = TRUE))
             )
    

    Read and Write

    Writing the data is easy.

    library(arrow)
    
    write_parquet(test_data , "test_data.parquet")
    
    # you can also mess with the compression
    write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
    

    Reading the data is also easy.

    read_parquet("test_data.parquet")
    
    # this option will result in lightning fast reads, but in a different format.
    read_parquet("test_data2.parquet", as_data_frame = FALSE)
    

    I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.

    This file is nowhere near as large as the benchmark article, so maybe that is the difference.

    Tests

    • rds: test_data.rds (20.3 MB)
    • parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
    • parquet2: test_data2.parquet (14.9 MB with higher compression)
    • parquet: test_data.parquet (40.7 MB)
    • fst2: test_data2.fst (27.9 MB with higher compression)
    • fst: test_data.fst (76.8 MB)
    • fread2: test_data.csv.gz (23.6MB)
    • fread: test_data.csv (98.7MB)
    • feather_arrow: test_data.feather (157.2 MB read with arrow)
    • feather: test_data.feather (157.2 MB read with feather)

    Observations

    For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.

    Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.

    0 讨论(0)
  • 2020-11-21 05:30

    I've tried all above and [readr][1] made the best job. I have only 8gb RAM

    Loop for 20 files, 5gb each, 7 columns:

    read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))
    
    0 讨论(0)
  • 2020-11-21 05:31

    I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

    There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

    Here's my test code:

    Set up the test data:

    bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
    write.csv(bigdf, 'bigdf.csv', quote = F)
    

    I restarted R before running the following import routine:

    library(sqldf)
    f <- file("bigdf.csv")
    system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
    

    I let the following line run all night but it never completed:

    system.time(big.df <- read.csv('bigdf.csv'))
    
    0 讨论(0)
提交回复
热议问题