Best way to read CSV in Ruby. FasterCSV?

问题

I have a CSV file that I want to read with Ruby and create Ruby objects to insert into a MySQL database with Active Record. What's the best way to do this? I see two clear options: FasterCSV & the Ruby core CSV. Which is better? Is there a better option that I'm missing?

EDIT: Gareth says to use FasterCSV, so what's the best way to read a CSV file using FasterCSV? Looking at the documentation, I see methods called parse, foreach, read, open... It says that foreach "is intended as the primary interface for reading CSV files." So, I guess I should use that one?

回答1:

Ruby 1.9 adopted FasterCSV as its core CSV processor, so I would say it's definitely better to go for FasterCSV, even if you're still using Ruby 1.8

回答2:

If you have a lot of records to import you might want to use MySQL's loader. It's going to be extremely fast.

LOAD DATA INFILE can be used to read files obtained from external sources. For example, many programs can export data in comma-separated values (CSV) format, such that lines have fields separated by commas and enclosed within double quotation marks, with an initial line of column names. If the lines in such a file are terminated by carriage return/newline pairs, the statement shown here illustrates the field- and line-handling options you would use to load the file:

LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
  FIELDS TERMINATED BY ',' ENCLOSED BY '"'
  LINES TERMINATED BY '\r\n'
  IGNORE 1 LINES;

If the input values are not necessarily enclosed within quotation marks, use OPTIONALLY before the ENCLOSED BY keywords.

Use that to pull everything into a temporary table, then use ActiveRecord to run queries against it to delete records that you don't want, then copy from the temp table to your production one, then drop or truncate the temp. Or, use ActiveRecord to search the temporary table and copy the records to production, then drop or truncate the temp. You might even be able to do a table-to-table copy inside MySQL or append one table to another.

It's going to be tough to beat the speed of the dedicated loader, and using the database's query mechanism to process records in bulk. The step of turning a record in the CSV file into an object, then using the ORM to write it to the database adds a lot of extra overhead, so unless you have some super difficult validations requiring Ruby's logic, you'll be faster going straight to the database.

EDIT: Here's a simple CSV header to DB column mapper example:

require "csv"

data = <<EOT
header1, header2, header 3
1, 2, 3
2, 2, 3
3, 2, 3
EOT

header_to_table_columns = {
  'header1'  => 'col1',
  'header2'  => 'col2',
  'header 3' => 'col3'
}

arr_of_arrs = CSV.parse(data)
headers = arr_of_arrs.shift.map{ |i| i.strip }
db_cols = header_to_table_columns.values_at(*headers)
arr_of_arrs.each do |ary|
  # insert into the database using an ORM or by creating insert statements
end

来源：https://stackoverflow.com/questions/4369039/best-way-to-read-csv-in-ruby-fastercsv

标签

mysql

ruby

csv

fastercsv