I\'m processing huge data files (millions of lines each).
Before I start processing I\'d like to get a count of the number of lines in the file, so I can then indic
The test results for more than 135k lines are shown below. This is my benchmark code.
file_name = '100m.csv'
Benchmark.bm do |x|
x.report { File.new(file_name).readlines.size }
x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i }
x.report { File.read(file_name).scan(/\n/).count }
end
result is
user system total real
0.100000 0.040000 0.140000 ( 0.143636)
0.000000 0.000000 0.090000 ( 0.093293)
0.380000 0.060000 0.440000 ( 0.464925)
The wc -l
code has one problem.
If there is only one line in the file and the last character does not end with \n
, then count is zero.
So, I recommend calling wc when you count more then one line.
If you are in a Unix environment, you can just let wc -l
do the work.
It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.
SSCCE:
filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count
Or if you want a collection of files passed on the command line:
wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count
Same as DJ's answer, but giving the actual Ruby code:
count = %x{wc -l file_path}.split[0].to_i
The first part
wc -l file_path
Gives you
num_lines file_path
The split
and to_i
put that into a number.
I have this one liner.
puts File.foreach('myfile.txt').count
You can read the last line only and see its number:
f = File.new('huge-file')
f.readlines[-1]
count = f.lineno
Summary of the posted solutions
require 'benchmark'
require 'csv'
filename = "name.csv"
Benchmark.bm do |x|
x.report { `wc -l < #{filename}`.to_i }
x.report { File.open(filename).inject(0) { |c, line| c + 1 } }
x.report { File.foreach(filename).inject(0) {|c, line| c+1} }
x.report { File.read(filename).scan(/\n/).count }
x.report { CSV.open(filename, "r").readlines.count }
end
File with 807802 lines:
user system total real
0.000000 0.000000 0.010000 ( 0.030606)
0.370000 0.050000 0.420000 ( 0.412472)
0.360000 0.010000 0.370000 ( 0.374642)
0.290000 0.020000 0.310000 ( 0.315488)
3.190000 0.060000 3.250000 ( 3.245171)