I have problem where I need to download, unzip, and then process line by line a very large CSV file. I think it\'s useful to give you an idea how large the file is:
It's been a while since I posted this question and in case anyone else comes across it I thought it might be worth sharing what I found.
CSV
was too slow. My csv file was simple enough that I didn't need all that stuff to deal with quoted strings or type coercion anyway. It was much easier just use IO#gets
and then split the line on commas.Zip::Inputstream
to some IO
containing the csv data. This is because the zip file structure has the End of Central Directory (EOCD) at the end of the file. That is needed in order to extract the file so streaming it from http doesn't seem like it would work.The solution I ended up going with was to download the file to disk and then use Ruby's open3 library and the Linux unzip
package to stream the uncompressed csv file from the zip.
require 'open3'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
line = io.gets
# do stuff to process the CSV line
end
The -p
switch on unzip sends the extracted file to stdout. IO.popen
then use pipes to make that an IO
object in ruby. Works pretty nice. You could use it with the CSV
too if you wanted that extra processing, it was just too slow for me.
require 'open3'
require 'csv'
IO.popen('unzip -p /path/to/big_file.zip big_file.csv', 'rb') do |io|
CSV.foreach(io) do |row|
# process the row
end
end