Still struggling with handling large data set

问题

I have been reading around on this website and haven't been able to find the exact answer. If it already exists, I apologize for the repost.

I am working with data sets that are extremely large (600 million rows, 64 columns on a computer with 32 GB of RAM). I really only need much smaller subsets of this data, but am struggling to perform any functions besides simply importing one data set in with fread, and selecting the 5 columns I need. After that, I try to overwrite my dataset with the specific conditions I need, but I hit my RAM cap and get the message "Error: cannot allocate vector size of 4.5 GB. I looked at ff and bigmemory packages as alternatives, but it seems like you can't subset before importing in those packages? Is there any solution to this problem besides upgrading RAM on computer?

Tasks I am trying to perform:

>SampleTable<-fread("my.csv", header = T, sep = ",", select=c("column1", "column2", "column7", "column12", "column15"))

>SampleTable2<-SampleTable[SampleTable[,column1=="6" & column7=="1"]]

At this point, I hit my memory cap. Would it be better to try and use another package but import all 64 columns of 600 million rows? I also don't want to spend hours upon hours just to perform one import.

回答1:

What you could do is read the CSV file in chunks:

# Define only the subset of columns
csv <- "my.csv"
colnames <- names(read.csv(csv, header = TRUE, nrows = 1))
colclasses <- rep(list(NULL), length(colnames))
ind <- c(1, 2, 7, 12, 15)
colclasses[ind] <- "double"

# Read header and first line
library(dplyr)
l_df <- list()
con <- file(csv, "rt")
df <- read.csv(con, header = TRUE, nrows = 1, colClasses = colclasses) %>%
  filter(V1 == 6, V7 == 1)
names(df) <- paste0("V", ind)
l_df[[i <- 1]] <- df

# Read all other lines and combine
repeat {
  i <- i + 1
  df <- read.csv(con, header = FALSE, nrows = 9973, colClasses = colclasses)
  l_df[[i]] <- filter(df, V1 == 6, V7 == 1)
  if (nrow(df) < 9973) break
}
df <- do.call("rbind", l_df)

9973 is an arbitrary prime number which has few chance to be a divisor of nlines - 1.

回答2:

If your dataset can be easily parsed (no embedded commas for example):

library(data.table)

> fread('cat tmp.csv')
   col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 col16 col17
1:    6    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1
2:    2    2    2    2    2    2    2    2    2     2     2     2     2     2     2     2     2
> fread("cat tmp.csv | awk -F ',' 'NR == 1 || ($1 == 6 && $7 == 1)'")
   col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 col14 col15 col16 col17
1:    6    1    1    1    1    1    1    1    1     1     1     1     1     1     1     1     1
> fread("cat tmp.csv | awk -F ',' 'NR == 1 || ($1 == 6 && $7 == 1) {print $1, $2, $7, $12, $15}'")
   col1 col2 col7 col12 col15
1:    6    1    1     1     1
>

来源：https://stackoverflow.com/questions/45362126/still-struggling-with-handling-large-data-set

标签

memory-management

data.table

r-bigmemory