Ruby: start reading at arbitrary point in large file

不打扰是莪最后的温柔 提交于 2019-12-03 20:18:01

For lines, it might be a bit difficult, but you can seek within a file to a certain byte.

IO#seek (link) and IO#pos (link) will both allow you to seek to a given byte within the file.

To see what sort of difference slurping the entire file at once vs line-by-line, I tested against a file that is about 99MB, with over 1,000,000 lines.

greg-mbp-wireless:Desktop greg$ wc filelist.txt 
 1003002 1657573 99392863 filelist.txt

I put the following loop into a ruby file and ran it from the command line with the time command:

IO.read(ARGV.first).lines { |l|
}

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m1.411s
user    0m0.653s
sys     0m0.169s

Then I changed it to read line-by-line and timed that too:

IO.readlines(ARGV.first) { |l|
}

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m1.053s
user    0m0.741s
sys     0m0.278s

I'm not sure why, but reading line by line is faster. That might be tied to memory allocation as Ruby tries to load the entire file into RAM in the first example, or maybe it was an anomaly since I only did the test once for each file. Using a read with an explicit filesize might be faster as Ruby will know how much it's going to need to allocate in advance.

And that was all I needed to test this:

fcontent = ''
File.open(ARGV.first, 'r') do |fi|
  fsize = fi.size
  fcontent = fi.read(fsize)
end
puts fcontent.size

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 
99392863

real    0m0.168s
user    0m0.010s
sys     0m0.156s

Looks like knowing how much needs to be read makes quite a difference.

Adding back in the loop over the string buffer results in this:

File.open(ARGV.first, 'r') do |fi|
  fsize = fi.size
  fi.read(fsize).lines { |l| 
  }
end

greg-mbp-wireless:Desktop greg$ time ruby test.rb filelist.txt 

real    0m0.732s
user    0m0.572s
sys     0m0.158s

That's still an improvement.

If you used a Queue and fed it from a thread that was responsible for reading a file, then consumed the queue from whatever processes the incoming text then you might see a higher overall throughput.

If you want to start at a specific line in the file I would recommend just shelling out to tail.

excerpt = `tail -m +5000 filename.log`

This would give you the contents of filename.log from line 5000 to the end of the file.

Try faster_csv if you haven't already and if thats still too slow use something that has native extensions in c like this - http://github.com/wwood/excelsior

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!