Quickly reading very large tables as dataframes

后端未结

关注

 11  1832

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like the

standard read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

optimized read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

fread

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

ff / ffdf

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

In summary:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

0 讨论(0)

醉话见心

2020-11-21 05:24
This was previously asked on R-Help, so that's worth reviewing.

One suggestion there was to use readChar() and then do string manipulation on the result with strsplit() and substr(). You can see the logic involved in readChar is much less than read.table.

I don't know if memory is an issue here, but you might also want to take a look at the HadoopStreaming package. This uses Hadoop, which is a MapReduce framework designed for dealing with large data sets. For this, you would use the hsTableReader function. This is an example (but it has a learning curve to learn Hadoop):
```
str <- "key1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey1\t3.9\nkey1\t8.9\nkey1\t1.2\nkey2\t9.9\nkey2\"
cat(str)
cols = list(key='',val=0)
con <- textConnection(str, open = "r")
hsTableReader(con,cols,chunkSize=6,FUN=print,ignoreKey=TRUE)
close(con)
```
The basic idea here is to break the data import into chunks. You could even go so far as to use one of the parallel frameworks (e.g. snow) and run the data import in parallel by segmenting the file, but most likely for large data sets that won't help since you will run into memory constraints, which is why map-reduce is a better approach.
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-11-21 05:30
I am reading data very quickly using the new arrow package. It appears to be in a fairly early stage.

Specifically, I am using the parquet columnar format. This converts back to a data.frame in R, but you can get even deeper speedups if you do not. This format is convenient as it can be used from Python as well.

My main use case for this is on a fairly restrained RShiny server. For these reasons, I prefer to keep data attached to the Apps (i.e., out of SQL), and therefore require small file size as well as speed.

This linked article provides benchmarking and a good overview. I have quoted some interesting points below.

https://ursalabs.org/blog/2019-10-columnar-perf/

File Size

That is, the Parquet file is half as big as even the gzipped CSV. One of the reasons that the Parquet file is so small is because of dictionary-encoding (also called “dictionary compression”). Dictionary compression can yield substantially better compression than using a general purpose bytes compressor like LZ4 or ZSTD (which are used in the FST format). Parquet was designed to produce very small files that are fast to read.

Read Speed

When controlling by output type (e.g. comparing all R data.frame outputs with each other) we see the the performance of Parquet, Feather, and FST falls within a relatively small margin of each other. The same is true of the pandas.DataFrame outputs. data.table::fread is impressively competitive with the 1.5 GB file size but lags the others on the 2.5 GB CSV.

Independent Test

I performed some independent benchmarking on a simulated dataset of 1,000,000 rows. Basically I shuffled a bunch of things around to attempt to challenge the compression. Also I added a short text field of random words and two simulated factors.

Data
```
library(dplyr)
library(tibble)
library(OpenRepGrid)

n <- 1000000

set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))


test_data <- mtcars %>%
  rownames_to_column() %>%
  sample_n(n, replace = TRUE) %>%
  mutate_all(~ sample(., length(.))) %>%
  mutate(factor1 = sample(some_levels1, n, replace = TRUE),
         factor2 = sample(some_levels2, n, replace = TRUE),
         text = randomSentences(n, sample(3:8, n, replace = TRUE))
         )
```
Read and Write

Writing the data is easy.
```
library(arrow)

write_parquet(test_data , "test_data.parquet")

# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)
```
Reading the data is also easy.
```
read_parquet("test_data.parquet")

# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)
```
I tested reading this data against a few of the competing options, and did get slightly different results than with the article above, which is expected.

This file is nowhere near as large as the benchmark article, so maybe that is the difference.

Tests
- rds: test_data.rds (20.3 MB)
- parquet2_native: (14.9 MB with higher compression and as_data_frame = FALSE)
- parquet2: test_data2.parquet (14.9 MB with higher compression)
- parquet: test_data.parquet (40.7 MB)
- fst2: test_data2.fst (27.9 MB with higher compression)
- fst: test_data.fst (76.8 MB)
- fread2: test_data.csv.gz (23.6MB)
- fread: test_data.csv (98.7MB)
- feather_arrow: test_data.feather (157.2 MB read with arrow)
- feather: test_data.feather (157.2 MB read with feather)
Observations

For this particular file, fread is actually very fast. I like the small file size from the highly compressed parquet2 test. I may invest the time to work with the native data format rather than a data.frame if I really need the speed up.

Here fst is also a great choice. I would either use the highly compressed fst format or the highly compressed parquet depending on if I needed the speed or file size trade off.
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2020-11-21 05:30

I've tried all above and [readr][1] made the best job. I have only 8gb RAM

Loop for 20 files, 5gb each, 7 columns:

read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

0 讨论(0)

予麋鹿

2020-11-21 05:31
I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

Here's my test code:

Set up the test data:
```
bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)
```
I restarted R before running the following import routine:
```
library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
```
I let the following line run all night but it never completed:
```
system.time(big.df <- read.csv('bigdf.csv'))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2

Quickly reading very large tables as dataframes

standard read.table

optimized read.table

fread

sqldf

ff / ffdf

In summary:

File Size

Read Speed

Independent Test

Data

Read and Write

Tests

Observations