Count the number of lines in a file without reading entire file into memory?

前端 未结 15 1334
忘掉有多难
忘掉有多难 2020-12-24 01:38

I\'m processing huge data files (millions of lines each).

Before I start processing I\'d like to get a count of the number of lines in the file, so I can then indic

相关标签:
15条回答
  • 2020-12-24 02:09

    The test results for more than 135k lines are shown below. This is my benchmark code.

     file_name = '100m.csv'
     Benchmark.bm do |x|
       x.report { File.new(file_name).readlines.size }
       x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i }
       x.report { File.read(file_name).scan(/\n/).count }
     end
    

    result is

       user     system      total        real
     0.100000   0.040000   0.140000 (  0.143636)
     0.000000   0.000000   0.090000 (  0.093293)
     0.380000   0.060000   0.440000 (  0.464925)
    

    The wc -l code has one problem. If there is only one line in the file and the last character does not end with \n, then count is zero.

    So, I recommend calling wc when you count more then one line.

    0 讨论(0)
  • 2020-12-24 02:10

    If you are in a Unix environment, you can just let wc -l do the work.

    It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.

    SSCCE:

    filename = 'a_file/somewhere.txt'
    line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
    p line_count
    

    Or if you want a collection of files passed on the command line:

    wc_output = `wc -l "#{ARGV.join('" "')}"`
    line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
    p line_count
    
    0 讨论(0)
  • 2020-12-24 02:12

    Same as DJ's answer, but giving the actual Ruby code:

    count = %x{wc -l file_path}.split[0].to_i
    

    The first part

    wc -l file_path
    

    Gives you

    num_lines file_path
    

    The split and to_i put that into a number.

    0 讨论(0)
  • 2020-12-24 02:12

    I have this one liner.

    puts File.foreach('myfile.txt').count
    
    0 讨论(0)
  • 2020-12-24 02:14

    You can read the last line only and see its number:

    f = File.new('huge-file')
    f.readlines[-1]
    count = f.lineno
    
    0 讨论(0)
  • 2020-12-24 02:16

    Summary of the posted solutions

    require 'benchmark'
    require 'csv'
    
    filename = "name.csv"
    
    Benchmark.bm do |x|
      x.report { `wc -l < #{filename}`.to_i }
      x.report { File.open(filename).inject(0) { |c, line| c + 1 } }
      x.report { File.foreach(filename).inject(0) {|c, line| c+1} }
      x.report { File.read(filename).scan(/\n/).count }
      x.report { CSV.open(filename, "r").readlines.count }
    end
    

    File with 807802 lines:

           user     system      total        real
       0.000000   0.000000   0.010000 (  0.030606)
       0.370000   0.050000   0.420000 (  0.412472)
       0.360000   0.010000   0.370000 (  0.374642)
       0.290000   0.020000   0.310000 (  0.315488)
       3.190000   0.060000   3.250000 (  3.245171)
    
    0 讨论(0)
提交回复
热议问题