问题
Took me a while to figure this out. So, I am answering my own question.
You have some .csv, you want to load it fast, you want to use the sqldf
package. Your usual code is irritated by a few annoying fields. Example:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr",9:43:00, 99.2
1003,"Ben,Sr",9:44:00, 99.3
This code only works on *nix systems.
library(sqldf)
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
If try to read with
x <- read.csv.sql("temp.csv", header=FALSE)
R complains
Error in try({ :
RS-DBI driver: (RS_sqlite_import: ./temp.csv line 2 expected 4 columns of data but found 5)
The sqldf
-FAQ.13 solution doesn't work either:
x <- read.csv.sql("temp.csv", filter = "tr -d '\"' ", header=FALSE)
Again, R complains
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 5 elements
In fact, the filter only removes double quotes.
So, how to proceed?
回答1:
Perl and regexes to the rescue. Digging through SO, and toying with regexes here, it is not too hard come up with the right one:
s/(\"[^\",]+),([^\"]+\")/$1_$2/g
which matches "...,..."
, here the dots are anything but double quotes and commas, and substitues the comma with an underscore. A perl one-liner is the right filter to pass to sqldf:
x <- read.csv.sql("temp.csv",
filter = "perl -e 's/(\"[^\",]+)_([^\"]+\")/$1_$2/g'",
header=FALSE)
Here is the dataframe x
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
Now, DYO cosmesis on strings ...
EDIT: The regex above only replaces the first occurrence of a comma in the field. To replace all the occurrencies use this
s{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg
What's different?
- I replaced the delimiters
/
with{}
; - The option e at the very end, instructs the parser to interpret the replacement field as perl code;
- The replecement is a simple regex replace, that substitutes all "
,
" with "_
" within the matched substring$&
.
An example:
system("touch temp.csv")
system("echo '1001, Amy,9:43:00, 99.2\n1002,\"Ben,Jr,More,Commas\",9:43:00, 99.2\n1003,\"Ben,Sr\",9:44:00, 99.3' > temp.csv")
The file temp.csv looks like:
1001, Amy,9:43:00, 99.2
1002,"Ben,Jr,More,Commas",9:43:00, 99.2
1003, "Ben,Sr",9:44:00, 99.3
And can be read with
x <- read.csv.sql("temp.csv",
filter = "perl -p -e 's{(\"[^\",]+),([^\"]+\")}{$_= $&, s/,/_/g, $_}eg'",
header=FALSE)
> x
V1 V2 V3 V4
1 1001 Amy 9:43:00 99.2
2 1002 "Ben_Jr_More_Commas" 9:43:00 99.2
3 1003 "Ben_Sr" 9:44:00 99.3
回答2:
For windows, sqldf now comes with trcomma2dot.vbs which does this by default with read.csv2.sql . Although found it to be slow for very large data.(>1million rows)
It mentions about "tr" for non-windows based system but I could not try it.
来源:https://stackoverflow.com/questions/8086559/sqldf-csv-and-fields-containing-commas