Why does Perl's tr/\n// get slower and slower as line lengths increase?

前端未结

关注

 3  1293

慢半拍i 2021-02-07 20:25

In perlfaq5, there\'s an answer for How do I count the number of lines in a file?. The current answer suggests a sysread and a tr/\\n//. I wanted to tr

3条回答

夕颜 (楼主)

2021-02-07 20:29

I wondered whether the benchmarks we've been using have too many moving parts: we are crunching data files of different sizes, using different line lengths, and trying to gauge the speed of tr relative to its competitors -- with an underlying (but untested) assumptions that tr is the method whose performance is varying with line length.

Also, as brian has pointed out in a few comments, we are feeding tr buffers of data that are always the same size (4096 bytes). If any of the methods should be insensitive to line size, it should be tr.

And then it struck me: what if tr were the stable reference point and the other methods were the ones varying with line size? When you look out your spaceship window, is it you or that Klingon bird-of-prey that's moving?

So I developed a benchmark that held the size of the data files constant: line length varies, but the total number of bytes stays the same. As the results show:

tr is the approach least sensitive to variation in line length. Since the total N of bytes processed is constant for all three line lengths tested (short, medium, long), this means that tr is quite efficient at editing the string it is given. Even though the short-line data file requires many more edits, the tr approach is able to crunch the data file almost as fast as it handles the long-line file.
The methods that rely on <> speed up as the lines become longer, although at a diminishing rate. This makes sense: since each call to <> requires some work, it should be slower to process a given N of bytes using shorter lines (at least over the range tested).
The s/// approach is also sensitive to line length. Like tr, this approach works by editing the string it is given. Again, shorter line length means more edits. Apparently, the ability of s/// to make such edits is much less efficient than that of tr.

Here are the results on Solaris with Perl 5.8.8:

#   ln = $.      <>, then check $.
#   nn = $n      <>, counting lines
#   tr = tr///   using sysread
#   ss = s///    using sysread

#   S = short lines  (50)
#   M = medium lines (500)
#   L = long lines   (5000)

       Rate nn-S
nn-S 1.66/s   --
ln-S 1.81/s   9%
ss-S 2.45/s  48%
nn-M 4.02/s 142%
ln-M 4.07/s 145%
ln-L 4.65/s 180%
nn-L 4.65/s 180%
ss-M 5.85/s 252%
ss-L 7.04/s 324%
tr-S 7.30/s 339%    # tr
tr-L 7.63/s 360%    # tr
tr-M 7.69/s 363%    # tr

The results on Windows ActiveState's Perl 5.10.0 were roughly comparable.

Finally, the code:

use strict;
use warnings;
use Set::CrossProduct;
use Benchmark qw(cmpthese);

# Args: file size (in million bytes)
#       N of benchmark iterations
#       true/false (whether to regenerate files)
#
# My results were run with 50 10 1
main(@ARGV);

sub main {
    my ($file_size, $benchmark_n, $regenerate) = @_;
    $file_size *= 1000000;
    my @file_names = create_files($file_size, $regenerate);
    my %methods = (
        ln => \&method_ln,  # $.
        nn => \&method_nn,  # $n
        tr => \&method_tr,  # tr///
        ss => \&method_ss,  # s///
    );
    my $combo_iter = Set::CrossProduct->new([ [keys %methods], \@file_names ]);
    open my $log_fh, '>', 'log.txt';
    my %benchmark_args = map {
        my ($m, $f) = @$_;
        "$m-$f" => sub { $methods{$m}->($f, $log_fh) }
    } $combo_iter->combinations;
    cmpthese($benchmark_n, \%benchmark_args);
    close $log_fh;
}

sub create_files {
    my ($file_size, $regenerate) = @_;
    my %line_lengths = (
        S =>    50,
        M =>   500,
        L =>  5000,
    );
    for my $f (keys %line_lengths){
        next if -f $f and not $regenerate;
        create_file($f, $line_lengths{$f}, $file_size);
    }
    return keys %line_lengths;
}

sub create_file {
    my ($file_name, $line_length, $file_size) = @_;
    my $n_lines = int($file_size / $line_length);
    warn "Generating $file_name with $n_lines lines\n";
    my $line = 'a' x ($line_length - 1);
    chop $line if $^O eq 'MSWin32';
    open(my $fh, '>', $file_name) or die $!;
    print $fh $line, "\n" for 1 .. $n_lines;
    close $fh;
}

sub method_nn {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    $n ++ while <$data_fh>;
    print $log_fh "$data_file \$n $n\n";
    close $data_fh;
}

sub method_ln {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    1 while <$data_fh>;
    print $log_fh "$data_file \$. $.\n";
    close $data_fh;
}

sub method_tr {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    my $buffer;
    while (sysread $data_fh, $buffer, 4096) {
        $n += ($buffer =~ tr/\n//);
    }
    print $log_fh "$data_file tr $n\n";
    close $data_fh;
}

sub method_ss {
    my ($data_file, $log_fh) = @_;
    open my $data_fh, '<', $data_file;
    my $n = 0;
    my $buffer;
    while (sysread $data_fh, $buffer, 4096) {
        $n += ($buffer =~ s/\n//g);
    }
    print $log_fh "$data_file s/ $n\n";
    close $data_fh;
}

Update in response to Brad's comment. I tried all three variants, and they behaved roughly like s/\n//g -- slower for the data files with shorter lines (with the additional qualification that s/(\n)/$1/ was even slower than the others). The interesting part was that m/\n/g was basically the same speed as s/\n//g, suggesting that the slowness of the regex approach (both s/// and m//) does not hinge directly on the matter of editing the string.

0 讨论(0)

查看其它3个回答