问题
I'm currently trying to open a 48GB csv on my computer. Needless to say that my RAM does no support such a huge file, so I'm trying to filter it before opening. From what I've researched, the most appropriate way to do so in R is using the sqldf
lib, more specifically the read.csv.sql
function:
df <- read.csv.sql('CIF_FOB_ITIC-en.csv', sql = "SELECT * FROM file WHERE 'Year' IN (2014, 2015, 2016, 2017, 2018)")
However, I got the following message:
Erro: duplicate column name: Measure
As SQL is case insensitive, having two variables, one named Measure and another named MEASURE, implies duplicity in column names. To get around this, I tried using the header = FALSE
argument and substituted the 'Year'
by V9
, yielding the following error instead:
Error in connection_import_file(conn@ptr, name, value, sep, eol, skip) : RS_sqlite_import: CIF_FOB_ITIC-en.csv line 2 expected 19 columns of data but found 24
How should I proceed in this case?
Thanks in advance!
回答1:
Here's a Tidyverse solution that reads in chunks of the CSV, filters them, and stacks up the resulting rows. This code also does this in parallel, so the whole file gets scanned, but far more quickly (depending on your core count) than if the chunks were processed one at a time, as with apply
(or purrr::map
for that matter).
Comments inline.
library(tidyverse)
library(furrr)
# Make a CSV file out of the NASA stock dataset for demo purposes
raw_data_path <- tempfile(fileext = ".csv")
nasa %>% as_tibble() %>% write_csv(raw_data_path)
# Get the row count of the raw data, incl. header row, without loading the
# actual data
raw_data_nrow <- length(count.fields(raw_data_path))
# Hard-code the largest batch size you can, given your RAM in relation to the
# data size per row
batch_size <- 1e3
# Set up parallel processing of multiple chunks at a time, leaving one virtual
# core, as usual
plan(multiprocess, workers = availableCores() - 1)
filtered_data <-
# Define the sequence of start-point row numbers for each chunk (each number
# is actually the start point minus 1 since we're using the seq. no. as the
# no. of rows to skip)
seq(from = 0,
# Add the batch size to ensure that the last chunk is large enough to grab
# all the remainder rows
to = raw_data_nrow + batch_size,
by = batch_size) %>%
future_map_dfr(
~ read_csv(
raw_data_path,
skip = .x,
n_max = batch_size,
# Can't read in col. names in each chunk since they're only present in the
# 1st chunk
col_names = FALSE,
# This reads in each column as character, which is safest but slowest and
# most memory-intensive. If you're sure that each batch will contain
# enough values in each column so that the type detection in each batch
# will come to the same conclusions, then comment this out and leave just
# the guess_max
col_types = cols(.default = "c"),
guess_max = batch_size
) %>%
# This is where you'd insert your filter condition(s)
filter(TRUE),
# Progress bar! So you know how many chunks you have left to go
.progress = TRUE
) %>%
# The first row will be the header values, so set the column names to equal
# that first row, and then drop it
set_names(slice(., 1)) %>%
slice(-1)
来源:https://stackoverflow.com/questions/59482431/how-to-filter-a-very-large-csv-in-r-prior-to-opening-it