Clean up code and keep null values from crashing read.csv.sql

问题

I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.

Here is a small data set so my problem can be reproduced:

write.csv(iris, "iris.csv", row.names = F)
library(sqldf)
csvFile <- "iris.csv"

I am finding that the notation you have to use is extremely awkward using read.csv.sql the following is the how I am reading in the file:

# Step 1 (Assume these values are coming from UI)
spec <- 'setosa'
petwd <- 0.2

# Add quotes and make comma-separated:
spec <- toString(sprintf("'%s'", spec)) 
petwd <- toString(sprintf("'%s'", petwd)) 

# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where 
                                  "Species" in ($spec)'
                                  and "Petal.Width" in ($petwd)',
                     filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))

My main problem is that if any of the values above (from UI) are null then it won't read in the data properly, because this chunk of code is all hard coded.
I would like to change this into: Step 1 - check which values are null and do not filter off of them, then filter using read.csv.sql for all non-null values on corresponding columns.

Note: I am reusing the code from this similar question within this question.

UPDATE
I want to clear up what I am asking. This is what I am trying to do:

If a field, say spec comes through as NA (meaning the user did not pick input) then I want it to filter as such (default to spec == EVERY SPEC):

# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where 
                                  "Petal.Width" in ($petwd)',
                     filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))

Since spec is NA, if you try to filter/read in a file matching spec == NA it will read in an empty data set since there are no NA values in my data, hence breaking the code and program. Hope this clears it up more.

回答1:

There are several problems:

some of the simplifications provided in the link in the question were not followed.
spec is a scalar so one can just use '$spec'
petwd is a numeric scalar and SQL does not require quotes around numbers so just use $petwd
the question states you want to handle empty fields but not how so we have used csvfix to map them to -1 and also strip off quotes. (Alternately let them enter and do it in R. Empty numerics will come through as 0 and empty character fields will come through as zero length character fields.)
you can use [...] in place of "..." in SQL

The code below worked for me in both Windows and Ubuntu Linux with the bash shell.

library(sqldf)

spec <- 'setosa'
petwd <- 0.2

d <- fn$read.csv.sql(
  "iris.csv", 
  sql = "select * from file where [Species] = '$spec' and [Petal.Width] = $petwd", 
  verbose = TRUE, 
  filter = 'csvfix map -smq -fv "" -tv -1'
)

Update

Regarding the update at the end of the question it was clarified that the NA could be in spec as opposed to being in the data being read in and that if spec is NA then the condition involving spec should be regarded as TRUE. In that case just expand the SQL where condition to handle that as follows.

spec <- NA
petwd <- 0.2

d <- fn$read.csv.sql(
  "iris.csv", 
  sql = "select * from file 
         where ('$spec' == 'NA' or [Species] = '$spec') and [Petal.Width] = $petwd", 
  verbose = TRUE, 
  filter = 'csvfix echo -smq'
)

The above will return all rows for which Petal.Width is 0.2 .

来源：https://stackoverflow.com/questions/52882551/clean-up-code-and-keep-null-values-from-crashing-read-csv-sql

标签

sql

conditional

read.csv