I\'m processing huge data files (millions of lines each).
Before I start processing I\'d like to get a count of the number of lines in the file, so I can then indic
With UNIX style text files, it's very simple
f = File.new("/path/to/whatever")
num_newlines = 0
while (c = f.getc) != nil
num_newlines += 1 if c == "\n"
end
That's it. For MS Windows text files, you'll have to check for a sequence of "\r\n" instead of just "\n", but that's not much more difficult. For Mac OS Classic text files (as opposed to Mac OS X), you would check for "\r" instead of "\n".
So, yeah, this looks like C. So what? C's awesome and Ruby is awesome because when a C answer is easiest that's what you can expect your Ruby code to look like. Hopefully your dain hasn't already been bramaged by Java.
By the way, please don't even consider any of the answers above
that use the IO#read
or IO#readlines
method in turn calling a
String method on what's been read. You said you didn't want to
read the whole file into memory and that's exactly what these do.
This is why Donald Knuth recommends people understand how to program
closer to the hardware because if they don't they'll end up writing
"weird code". Obviously you don't want to code close to the
hardware whenever you don't have to, but that should be common sense.
However you should learn to recognize the instances which you do have
to get closer to the nuts and bolts such as this one.
And don't try to get more "object oriented" than the situation calls for. That's an embarrassing trap for newbies who want to look more sophisticated than they really are. You should always be glad for the times when the answer really is simple, and not be disappointed when there's no complexity to give you the opportunity to write "impressive" code. However if you want to look somewhat "object oriented" and don't mind reading an entire line into memory at a time (i.e., you know the lines are short enough), you can do this
f = File.new("/path/to/whatever")
num_newlines = 0
f.each_line do
num_newlines += 1
end
This would be a good compromise but only if the lines aren't too long in which case it might even run more quickly than my first solution.
wc -l
in Ruby with less memory, the lazy way:
(ARGV.length == 0 ?
[["", STDIN]] :
ARGV.lazy.map { |file_name|
[file_name, File.open(file_name)]
})
.map { |file_name, file|
"%8d %s\n" % [*file
.each_line
.lazy
.map { |line| 1 }
.reduce(:+), file_name]
}
.each(&:display)
as originally shown by Shugo Maeda.
Example:
$ curl -s -o wc.rb -L https://git.io/vVrQi
$ chmod u+x wc.rb
$ ./wc.rb huge_data_file.csv
43217291 huge_data_file.csv
Reading the file a line at a time:
count = File.foreach(filename).inject(0) {|c, line| c+1}
or the Perl-ish
File.foreach(filename) {}
count = $.
or
count = 0
File.open(filename) {|f| count = f.read.count("\n")}
Will be slower than
count = %x{wc -l #{filename}}.split.first.to_i