Count the number of lines in a file without reading entire file into memory?

前端未结

关注

 15  1336

I\'m processing huge data files (millions of lines each).

Before I start processing I\'d like to get a count of the number of lines in the file, so I can then indic

Benchmark

require "benchmark"
require "benchmark/ips"
require "csv"

filename = ENV.fetch("FILENAME")

Benchmark.ips do |x|
  x.report("wc") { `wc -l #{filename}`.to_i }
  x.report("open") { File.open(filename).inject(0, :next) }
  x.report("foreach") { File.foreach(filename).inject(0, :next) }
  x.report("foreach $.") { File.foreach(filename) {}; $. }
  x.report("read.scan.length") { File.read(filename).scan(/\n/).length }
  x.report("CSV.open.readlines") { CSV.open(filename, "r").readlines.length }
  x.report("IO.readlines.length") { IO.readlines(filename).length }

  x.compare!
end

On my MacBook Pro (2017) with a 2.3 GHz Intel Core i5 processor:

Warming up --------------------------------------
                  wc     8.000  i/100ms
                open     2.000  i/100ms
             foreach     2.000  i/100ms
          foreach $.     2.000  i/100ms
    read.scan.length     2.000  i/100ms
  CSV.open.readlines     1.000  i/100ms
 IO.readlines.length     2.000  i/100ms
Calculating -------------------------------------
                  wc    115.014  (±21.7%) i/s -    552.000  in   5.020531s
                open     22.450  (±26.7%) i/s -    104.000  in   5.049692s
             foreach     32.669  (±27.5%) i/s -    150.000  in   5.046793s
          foreach $.     25.244  (±31.7%) i/s -    112.000  in   5.020499s
    read.scan.length     44.102  (±31.7%) i/s -    190.000  in   5.033218s
  CSV.open.readlines      2.395  (±41.8%) i/s -     12.000  in   5.262561s
 IO.readlines.length     36.567  (±27.3%) i/s -    162.000  in   5.089395s

Comparison:
                  wc:      115.0 i/s
    read.scan.length:       44.1 i/s - 2.61x  slower
 IO.readlines.length:       36.6 i/s - 3.15x  slower
             foreach:       32.7 i/s - 3.52x  slower
          foreach $.:       25.2 i/s - 4.56x  slower
                open:       22.4 i/s - 5.12x  slower
  CSV.open.readlines:        2.4 i/s - 48.02x  slower

This was made with a file containing 75 516 lines, and 3 532 510 characters (~47 chars per line). You should try this with your own file/dimensions and computer for a precise result.

0 讨论(0)

情书的邮戳

2020-12-24 02:20
Using foreach without inject is about 3% faster than with inject. Both are very much faster (more than 100x in my experience) than using getc.

Using foreach without inject can also be slightly simplified (relative to the snippet given elsewhere in this thread) as follows:
```
count = 0;  File.foreach(path) { count+=1}
puts "count: #{count}"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一个人的身影

2020-12-24 02:21

If the file is a CSV file, the length of the records should be pretty uniform if the content of the file is numeric. Wouldn't it make sense to just divide the size of the file by the length of the record or a mean of the first 100 records.

0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-12-24 02:23
It doesn't matter what language you're using, you're going to have to read the whole file if the lines are of variable length. That's because the newlines could be anywhere and theres no way to know without reading the file (assuming it isn't cached, which generally speaking it isn't).

If you want to indicate progress, you have two realistic options. You can extrapolate progress based on assumed line length:
```
assumed lines in file = size of file / assumed line size
progress = lines processed / assumed lines in file * 100%
```
since you know the size of the file. Alternatively you can measure progress as:
```
progress = bytes processed / size of file * 100%
```
This should be sufficient.
0 讨论(0)
发布评论:

提交评论
- 加载中...
遇见更好的自我

2020-12-24 02:24
For reasons I don't fully understand, scanning the file for newlines using File seems to be a lot faster than doing CSV#readlines.count.

The following benchmark used a CSV file with 1,045,574 lines of a data and 4 columns:
```
       user     system      total        real
   0.639000   0.047000   0.686000 (  0.682000)
  17.067000   0.171000  17.238000 ( 17.221173)
```
The code for the benchmark is below:
```
require 'benchmark'
require 'csv'

file = "1-25-2013 DATA.csv"

Benchmark.bm do |x|
    x.report { File.read(file).scan(/\n/).count }
    x.report { CSV.open(file, "r").readlines.count }
end
```
As you can see, scanning the file for newlines is an order of magnitude faster.
0 讨论(0)
发布评论:

提交评论
- 加载中...