Quickly reading very large tables as dataframes

后端 未结 11 1779
清歌不尽
清歌不尽 2020-11-21 04:46

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like the

11条回答
  •  予麋鹿
    予麋鹿 (楼主)
    2020-11-21 05:31

    I didn't see this question initially and asked a similar question a few days later. I am going to take my previous question down, but I thought I'd add an answer here to explain how I used sqldf() to do this.

    There's been little bit of discussion as to the best way to import 2GB or more of text data into an R data frame. Yesterday I wrote a blog post about using sqldf() to import the data into SQLite as a staging area, and then sucking it from SQLite into R. This works really well for me. I was able to pull in 2GB (3 columns, 40mm rows) of data in < 5 minutes. By contrast, the read.csv command ran all night and never completed.

    Here's my test code:

    Set up the test data:

    bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
    write.csv(bigdf, 'bigdf.csv', quote = F)
    

    I restarted R before running the following import routine:

    library(sqldf)
    f <- file("bigdf.csv")
    system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))
    

    I let the following line run all night but it never completed:

    system.time(big.df <- read.csv('bigdf.csv'))
    

提交回复
热议问题