In perlfaq5, there\'s an answer for How do I count the number of lines in a file?. The current answer suggests a sysread
and a tr/\\n//
. I wanted to tr
I wondered whether the benchmarks we've been using have too many moving parts: we are crunching data files of different sizes, using different line lengths, and trying to gauge the speed of tr
relative to its competitors -- with an underlying (but untested) assumptions that tr
is the method whose performance is varying with line length.
Also, as brian has pointed out in a few comments, we are feeding tr
buffers of data that are always the same size (4096 bytes). If any of the methods should be insensitive to line size, it should be tr
.
And then it struck me: what if tr
were the stable reference point and the other methods were the ones varying with line size? When you look out your spaceship window, is it you or that Klingon bird-of-prey that's moving?
So I developed a benchmark that held the size of the data files constant: line length varies, but the total number of bytes stays the same. As the results show:
tr
is the approach least sensitive
to variation in line length. Since
the total N of bytes processed is
constant for all three line lengths
tested (short, medium, long), this
means that tr
is quite efficient at
editing the string it is given. Even
though the short-line data file
requires many more edits, the tr
approach is able to crunch the data
file almost as fast as it handles the
long-line file.<>
speed
up as the lines become longer,
although at a diminishing rate. This
makes sense: since each call to <>
requires some work, it should be
slower to process a given N of bytes
using shorter lines (at least over
the range tested).s///
approach is also sensitive
to line length. Like tr
, this
approach works by editing the string
it is given. Again, shorter line
length means more edits. Apparently,
the ability of s///
to make such
edits is much less efficient than
that of tr
.Here are the results on Solaris with Perl 5.8.8:
# ln = $. <>, then check $.
# nn = $n <>, counting lines
# tr = tr/// using sysread
# ss = s/// using sysread
# S = short lines (50)
# M = medium lines (500)
# L = long lines (5000)
Rate nn-S
nn-S 1.66/s --
ln-S 1.81/s 9%
ss-S 2.45/s 48%
nn-M 4.02/s 142%
ln-M 4.07/s 145%
ln-L 4.65/s 180%
nn-L 4.65/s 180%
ss-M 5.85/s 252%
ss-L 7.04/s 324%
tr-S 7.30/s 339% # tr
tr-L 7.63/s 360% # tr
tr-M 7.69/s 363% # tr
The results on Windows ActiveState's Perl 5.10.0 were roughly comparable.
Finally, the code:
use strict;
use warnings;
use Set::CrossProduct;
use Benchmark qw(cmpthese);
# Args: file size (in million bytes)
# N of benchmark iterations
# true/false (whether to regenerate files)
#
# My results were run with 50 10 1
main(@ARGV);
sub main {
my ($file_size, $benchmark_n, $regenerate) = @_;
$file_size *= 1000000;
my @file_names = create_files($file_size, $regenerate);
my %methods = (
ln => \&method_ln, # $.
nn => \&method_nn, # $n
tr => \&method_tr, # tr///
ss => \&method_ss, # s///
);
my $combo_iter = Set::CrossProduct->new([ [keys %methods], \@file_names ]);
open my $log_fh, '>', 'log.txt';
my %benchmark_args = map {
my ($m, $f) = @$_;
"$m-$f" => sub { $methods{$m}->($f, $log_fh) }
} $combo_iter->combinations;
cmpthese($benchmark_n, \%benchmark_args);
close $log_fh;
}
sub create_files {
my ($file_size, $regenerate) = @_;
my %line_lengths = (
S => 50,
M => 500,
L => 5000,
);
for my $f (keys %line_lengths){
next if -f $f and not $regenerate;
create_file($f, $line_lengths{$f}, $file_size);
}
return keys %line_lengths;
}
sub create_file {
my ($file_name, $line_length, $file_size) = @_;
my $n_lines = int($file_size / $line_length);
warn "Generating $file_name with $n_lines lines\n";
my $line = 'a' x ($line_length - 1);
chop $line if $^O eq 'MSWin32';
open(my $fh, '>', $file_name) or die $!;
print $fh $line, "\n" for 1 .. $n_lines;
close $fh;
}
sub method_nn {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
$n ++ while <$data_fh>;
print $log_fh "$data_file \$n $n\n";
close $data_fh;
}
sub method_ln {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
1 while <$data_fh>;
print $log_fh "$data_file \$. $.\n";
close $data_fh;
}
sub method_tr {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
my $buffer;
while (sysread $data_fh, $buffer, 4096) {
$n += ($buffer =~ tr/\n//);
}
print $log_fh "$data_file tr $n\n";
close $data_fh;
}
sub method_ss {
my ($data_file, $log_fh) = @_;
open my $data_fh, '<', $data_file;
my $n = 0;
my $buffer;
while (sysread $data_fh, $buffer, 4096) {
$n += ($buffer =~ s/\n//g);
}
print $log_fh "$data_file s/ $n\n";
close $data_fh;
}
Update in response to Brad's comment. I tried all three variants, and they behaved roughly like s/\n//g
-- slower for the data files with shorter lines (with the additional qualification that s/(\n)/$1/
was even slower than the others). The interesting part was that m/\n/g
was basically the same speed as s/\n//g
, suggesting that the slowness of the regex approach (both s///
and m//
) does not hinge directly on the matter of editing the string.
Long lines are about 65 times larger than short lines, and your numbers indicate that tr/\n// runs exactly 65 times slower. This is as expected.
wc evidently scales better for long lines. I don't really know why; perhaps because it is tuned to just count newlines, especially when you use the -l
option.
I'm also seeing tr///
get relatively slower as the line lengths increase although the effect isn't as dramatic. These results are from ActivePerl 5.10.1 (32-bit) on Windows 7 x64. I also got "too few iterations for a reliable count" warnings at 100 so I bumped the iterations up to 500.
VL: 4501.06 288
LO: 749.25 29
SH: 69.38 6
VA: 104.66 55
Rate VL-$count VL-$. VL-tr VL-s VL-wc
VL-$count 2.82/s -- -0% -52% -56% -99%
VL-$. 2.83/s 0% -- -51% -56% -99%
VL-tr 5.83/s 107% 106% -- -10% -99%
VL-s 6.45/s 129% 128% 11% -- -99%
VL-wc 501/s 17655% 17602% 8490% 7656% --
Rate LO-$count LO-$. LO-s LO-tr LO-wc
LO-$count 16.5/s -- -1% -50% -51% -97%
LO-$. 16.8/s 1% -- -50% -51% -97%
LO-s 33.2/s 101% 98% -- -3% -94%
LO-tr 34.1/s 106% 103% 3% -- -94%
LO-wc 583/s 3424% 3374% 1655% 1609% --
Rate SH-$count SH-$. SH-s SH-tr SH-wc
SH-$count 120/s -- -7% -31% -67% -81%
SH-$. 129/s 7% -- -26% -65% -80%
SH-s 174/s 45% 35% -- -52% -73%
SH-tr 364/s 202% 182% 109% -- -43%
SH-wc 642/s 433% 397% 269% 76% --
Rate VA-$count VA-$. VA-s VA-tr VA-wc
VA-$count 92.6/s -- -5% -36% -63% -79%
VA-$. 97.4/s 5% -- -33% -61% -78%
VA-s 146/s 57% 50% -- -42% -67%
VA-tr 252/s 172% 159% 73% -- -43%
VA-wc 439/s 374% 351% 201% 74% --
Edit: I did a revised benchmark to compare the rates for different line lengths. It clearly shows that tr///
starts out with a big advantage for short lines that rapidly disappears as the lines grow longer. As for why this happens, I can only speculate that tr///
is optimized for short strings.
Line count rate comparison http://img69.imageshack.us/img69/6250/linecount.th.png