Count the number of lines in a file without reading entire file into memory?

前端 未结 15 1327
忘掉有多难
忘掉有多难 2020-12-24 01:38

I\'m processing huge data files (millions of lines each).

Before I start processing I\'d like to get a count of the number of lines in the file, so I can then indic

相关标签:
15条回答
  • 2020-12-24 02:35

    With UNIX style text files, it's very simple

    f = File.new("/path/to/whatever")
    num_newlines = 0
    while (c = f.getc) != nil
      num_newlines += 1 if c == "\n"
    end
    

    That's it. For MS Windows text files, you'll have to check for a sequence of "\r\n" instead of just "\n", but that's not much more difficult. For Mac OS Classic text files (as opposed to Mac OS X), you would check for "\r" instead of "\n".

    So, yeah, this looks like C. So what? C's awesome and Ruby is awesome because when a C answer is easiest that's what you can expect your Ruby code to look like. Hopefully your dain hasn't already been bramaged by Java.

    By the way, please don't even consider any of the answers above that use the IO#read or IO#readlines method in turn calling a String method on what's been read. You said you didn't want to read the whole file into memory and that's exactly what these do. This is why Donald Knuth recommends people understand how to program closer to the hardware because if they don't they'll end up writing "weird code". Obviously you don't want to code close to the hardware whenever you don't have to, but that should be common sense. However you should learn to recognize the instances which you do have to get closer to the nuts and bolts such as this one.

    And don't try to get more "object oriented" than the situation calls for. That's an embarrassing trap for newbies who want to look more sophisticated than they really are. You should always be glad for the times when the answer really is simple, and not be disappointed when there's no complexity to give you the opportunity to write "impressive" code. However if you want to look somewhat "object oriented" and don't mind reading an entire line into memory at a time (i.e., you know the lines are short enough), you can do this

    f = File.new("/path/to/whatever")
    num_newlines = 0
    f.each_line do
      num_newlines += 1
    end
    

    This would be a good compromise but only if the lines aren't too long in which case it might even run more quickly than my first solution.

    0 讨论(0)
  • 2020-12-24 02:35

    wc -l in Ruby with less memory, the lazy way:

    (ARGV.length == 0 ?
     [["", STDIN]] :
        ARGV.lazy.map { |file_name|
            [file_name, File.open(file_name)]
    })
    .map { |file_name, file|
        "%8d %s\n" % [*file
                        .each_line
                        .lazy
                        .map { |line| 1 }
                        .reduce(:+), file_name]
    }
    .each(&:display)
    

    as originally shown by Shugo Maeda.

    Example:

    $ curl -s -o wc.rb -L https://git.io/vVrQi
    $ chmod u+x wc.rb
    $ ./wc.rb huge_data_file.csv
      43217291 huge_data_file.csv
    
    0 讨论(0)
  • 2020-12-24 02:36

    Reading the file a line at a time:

    count = File.foreach(filename).inject(0) {|c, line| c+1}
    

    or the Perl-ish

    File.foreach(filename) {}
    count = $.
    

    or

    count = 0
    File.open(filename) {|f| count = f.read.count("\n")}
    

    Will be slower than

    count = %x{wc -l #{filename}}.split.first.to_i
    
    0 讨论(0)
提交回复
热议问题