Notation issues with read.csv.sql in r

前端 未结 1 993
没有蜡笔的小新
没有蜡笔的小新 2021-01-25 01:02

I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size <

相关标签:
1条回答
  • 2021-01-25 01:25

    The problem is that sqldf provides text preprocessing faciliities but the code shown in the question does not use them making it overly complex.

    1) Regarding text substitution, use fn$ (from gsubfn which sqldf automatically loads) as discussed on the github page for sqldf. Assuming that we used quote = FALSE in the write.csv since sqlite does not handle quotes natively:

    spec <- 'setosa'
    out <- fn$read.csv.sql("iris.csv", "select * from file where Species = '$spec' ")
    
    spec <- c("setosa", "versicolor")
    string <- toString(sprintf("'%s'", spec)) # add quotes and make comma-separated
    out <- fn$read.csv.sql("iris.csv", "select * from file where Species in ($string) ")
    

    2) Regarding deleting double quotes, a simpler way would be to use the following filter= argument:

    read.csv.sql("iris.csv", filter = "tr -d \\042") # Windows
    

    or

    read.csv.sql("iris.csv", filter = "tr -d \\\\042") # Linux / bash
    

    depending on your shell. The first one worked for me on Windows (with Rtools installed and on the PATH) and the second worked for me on Linux with bash. It is possible that other variations could be needed for other shells.

    2a) Another possibility for removing quotes is to install the free csvfix utility (available on Windows, Linux and Mac) on your system and then use the following filter= argument which should work in all shells since it does not involve any characters that are typically interpreted specially by either R or most shells. Thus the following should work on all platforms.

    read.csv.sql("iris.csv", filter = "csvfix echo -smq")
    

    2b) Another cross platform utility that could be used is xsv. The eol= argument is only needed on Windows since xsv produces UNIX style line endings but won't hurt on other platforms so the following line should work on all platforms.

    read.csv.sql("iris.csv", eol = "\n", filter = "xsv fmt")
    

    2c) sqldf also includes an awk program (csv.awk) that can be used. It outputs UNIX style newlines so specify eol = "\n" on Windows. On other platforms it won't hurt if you specify it but you can omit it if you wish since that is the default on those platforms.

    csv.awk <- system.file("csv.awk", package = "sqldf")
    rm_quotes_cmd <- sprintf('gawk -f "%s"', csv.awk)
    read.csv.sql("iris.csv", eol = "\n", filter = rm_quotes_cmd)
    

    3) Regarding general tips, note that the verbose=TRUE argument to read.csv.sql can be useful to see what it is going on.

    read.csv.sql("iris.csv", verbose = TRUE)
    
    0 讨论(0)
提交回复
热议问题