I have this line as an example from a CSV file:
2412,21,\"Which of the following is not found in all cells?\",\"Curriculum\",\"Life and Living Processes, Lif
This is not a suitable task for regular expressions. You need a CSV parser, and Ruby has one built in:
http://ruby-doc.org/stdlib/libdoc/csv/rdoc/classes/CSV.html
And an arguably superior 3rd part library:
http://fastercsv.rubyforge.org/
EDIT: I failed to read the Ruby tag. The good news is, the guide will explain the theory behind building this, even if the language specifics aren't right. Sorry.
Here is a fantastic guide to doing this:
http://knab.ws/blog/index.php?/archives/10-CSV-file-parser-and-writer-in-C-Part-2.html
and the csv writer is here:
http://knab.ws/blog/index.php?/archives/3-CSV-file-parser-and-writer-in-C-Part-1.html
These examples cover the case of having a quoted literal in a csv (which may or may not contain a comma).
My preference is @steenstag's solution, but an alternative is to use String#scan with the following regular expression.
r = /(?<![^,])(?:(?!")[^,\n]*(?<!")|"[^"\n]*")(?![^,])/
If the variable str
holds the string given in the example, we obtain:
puts str.scan r
displays
2412
21
"Which of the following is not found in all cells?"
"Curriculum"
"Life and Living Processes, Life Processes"
1
0
"endofline"
Start your engine!
See also regex101 which provides a detailed explanation of each token of the regex. (Move your cursor across the regex.)
Ruby's regex engine performs the following operations.
(?<![^,]) : negative lookbehind assert current location is not preceded
by a character other than a comma
(?: : begin non-capture group
(?!") : negative lookahead asserts next char is not a double-quote
[^,\n]* : match 0+ chars other than a comma and newline
(?<!") : negative lookbehind asserts preceding character is not a
double-quote
| : or
" : match double-quote
[^"\n]* : match 0+ chars other than double-quote and newline
" : match double-quote
) : end of non-capture group
(?![^,]) : negative lookahead asserts current location is not followed
by a character other than a comma
Note that (?<![^,])
is the same as (?<=,|^)
and (?![^,])
is the same as (?=^|,)
.
This morning I stumbled across a CSV Table Importer project for Ruby-on-Rails. Eventually you will find the code helpful:
Github TableImporter
str=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
require 'csv' # built in
p CSV.parse(str)
# That's it! However, empty fields appear as nil.
# Makes sense to me, but if you insist on empty strings then do something like:
parser = CSV.new(str)
parser.convert{|field| field.nil? ? "" : field}
p parser.readlines
text=<<EOF
2412,21,"Which of the following is not found in all cells?","Curriculum","Life and Living Processes, Life Processes",,,1,0,"endofline"
EOF
x=[]
text.chomp.split("\042").each_with_index do |y,i|
i%2==0 ? x<< y.split(",") : x<<y
end
print x.flatten
output
$ ruby test.rb
["2412", "21", "Which of the following is not found in all cells?", "Curriculum", "Life and Living Processes, Life Processes", "", "", "", "1", "0", "endofline"]