I got a CSV dump from SQL Server 2008 that has lines like this:
Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1\" COPPER,1996-08-09 00:00:00
Construction,1971
If your CSV doesn't ever use a double quote as a legitimate quoting character, tweak the options to CSV to pass :quote_char => "\0"
and then you can do this (wrapped strings for clarity)
1.9.3p327 > puts 'Construction,197133031B,"MORGAN SHOES" ALT,
1997-05-13 00:00:00'.parse_csv(:quote_char => "\0")
Construction
197133031B
"MORGAN SHOES" ALT
1997-05-13 00:00:00
1.9.3p327 > puts 'Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,
1996-08-09 00:00:00'.parse_csv(:quote_char => "\0")
Plumbing
196222006P
REPLACE LEAD WATER SERVICE W/1" COPPER
1996-08-09 00:00:00
The following uses regexp and String#scan. I observe that in the broken CSV format you're dealing with, that "
only has quoting properties when it comes at the beginning and end of a field.
Scan moves through the string successively matching the regexp, so the regexp can assume its start match point is the beginning of a field. We construct the regexp so it can match a balanced quoted field with no internal quotes (QUOTED
) or a string of non-commas (UNQUOTED
). When either alternative field representation is matched, it must be followed by a separator which can be either comma or end of string (SEP
)
Because UNQUOTED
can match a zero length field before a separator, the scan always matches an empty field at the end which we discard with [0...-1]
. Scan produces an array of tuples; each tuple is an array of the capture groups, so we map
over each element picking the captured alternate with matches[0] || matches[1]
.
None of your example lines show a field which contains both a comma and a quote -- I have no idea how it would be legally represented and this code probably wont recognize such a field correctly.
SEP = /(?:,|\Z)/
QUOTED = /"([^"]*)"/
UNQUOTED = /([^,]*)/
FIELD = /(?:#{QUOTED}|#{UNQUOTED})#{SEP}/
def ugly_parse line
line.scan(FIELD)[0...-1].map{ |matches| matches[0] || matches[1] }
end
lines.each do |l|
puts l
puts ugly_parse(l).inspect
puts
end
# Electrical,197135021E,"SERVICE, OUTLETS",1997-05-15 00:00:00
# ["Electrical", "197135021E", "SERVICE, OUTLETS", "1997-05-15 00:00:00"]
#
# Plumbing,196222006P,REPLACE LEAD WATER SERVICE W/1" COPPER,1996-08-09 00:00:00
# ["Plumbing", "196222006P", "REPLACE LEAD WATER SERVICE W/1\" COPPER", "1996-08-09 00:00:00"]
#
# Construction,197133031B,"MORGAN SHOES" ALT,1997-05-13 00:00:00
# ["Construction", "197133031B", "MORGAN SHOES\" ALT", "1997-05-13 00:00:00"]