How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

问题

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.

With the following sample code (abbreviated for simplicity)

FasterCSV.open( file ){|csv|
  row = true
  while row
    begin
      row = csv.shift
      break unless row
      # Do things with the good rows here...

    rescue FasterCSV::MalformedCSVError => e
      # Do things with the bad rows here...
      next
    end
  end
}

The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?

回答1:

require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV

# File.open('test.txt','r').each do |line|
DATA.each do |line|
  begin
    CSV.parse(line) do |row|
      p row #handle row
    end
  rescue  CSV::MalformedCSVError => er
    puts er.message
    puts "This one: #{line}"
    # and continue
  end
end

# Output:

# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]   

__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid

Just feed the file line by line to FasterCSV and rescue the error.

回答2:

This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:

class MyFile < File
  attr_accessor :last_gets
  @last_gets = ''

  def gets(*args)
    line = super
    @last_gets << $/ << line
    line
  end
end

# then...

file  = MyFile.open(filename, 'r')
csv   = FasterCSV.new file

row = true
while row
  begin
    break unless row = csv.shift

    # do things with the good row here...

  rescue FasterCSV::MalformedCSVError => e
    bad_row = file.last_gets

    # do something with bad_row here...

    next
  ensure
    file.last_gets = '' # nuke the @last_gets "buffer"
  end
end

Kinda neat, right? BUT! there are caveats, of course:

I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This ~~fails utterly~~ might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.

Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.

回答3:

I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,

class MyFile < File
  def gets(*args)
    line = super
    if line != nil
      line.gsub!('\\"','""')  # fix the \" that would otherwise cause a parse error
    end
    line
  end
end

infile = MyFile.open(filename)
incsv = CSV.new(infile)

while row = infile.shift
  # process each row here
end

This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

来源：https://stackoverflow.com/questions/7671127/how-can-i-further-process-the-line-of-data-that-causes-the-ruby-fastercsv-librar

标签

ruby

fastercsv