Why does Perl's tr/\n// get slower and slower as line lengths increase?

前端 未结 3 584
遥遥无期
遥遥无期 2021-02-07 20:19

In perlfaq5, there\'s an answer for How do I count the number of lines in a file?. The current answer suggests a sysread and a tr/\\n//. I wanted to tr

相关标签:
3条回答
  • 2021-02-07 20:33

    I wondered whether the benchmarks we've been using have too many moving parts: we are crunching data files of different sizes, using different line lengths, and trying to gauge the speed of tr relative to its competitors -- with an underlying (but untested) assumptions that tr is the method whose performance is varying with line length.

    Also, as brian has pointed out in a few comments, we are feeding tr buffers of data that are always the same size (4096 bytes). If any of the methods should be insensitive to line size, it should be tr.

    And then it struck me: what if tr were the stable reference point and the other methods were the ones varying with line size? When you look out your spaceship window, is it you or that Klingon bird-of-prey that's moving?

    So I developed a benchmark that held the size of the data files constant: line length varies, but the total number of bytes stays the same. As the results show:

    • tr is the approach least sensitive to variation in line length. Since the total N of bytes processed is constant for all three line lengths tested (short, medium, long), this means that tr is quite efficient at editing the string it is given. Even though the short-line data file requires many more edits, the tr approach is able to crunch the data file almost as fast as it handles the long-line file.
    • The methods that rely on <> speed up as the lines become longer, although at a diminishing rate. This makes sense: since each call to <> requires some work, it should be slower to process a given N of bytes using shorter lines (at least over the range tested).
    • The s/// approach is also sensitive to line length. Like tr, this approach works by editing the string it is given. Again, shorter line length means more edits. Apparently, the ability of s/// to make such edits is much less efficient than that of tr.

    Here are the results on Solaris with Perl 5.8.8:

    #   ln = $.      <>, then check $.
    #   nn = $n      <>, counting lines
    #   tr = tr///   using sysread
    #   ss = s///    using sysread
    
    #   S = short lines  (50)
    #   M = medium lines (500)
    #   L = long lines   (5000)
    
           Rate nn-S
    nn-S 1.66/s   --
    ln-S 1.81/s   9%
    ss-S 2.45/s  48%
    nn-M 4.02/s 142%
    ln-M 4.07/s 145%
    ln-L 4.65/s 180%
    nn-L 4.65/s 180%
    ss-M 5.85/s 252%
    ss-L 7.04/s 324%
    tr-S 7.30/s 339%    # tr
    tr-L 7.63/s 360%    # tr
    tr-M 7.69/s 363%    # tr
    

    The results on Windows ActiveState's Perl 5.10.0 were roughly comparable.

    Finally, the code:

    use strict;
    use warnings;
    use Set::CrossProduct;
    use Benchmark qw(cmpthese);
    
    # Args: file size (in million bytes)
    #       N of benchmark iterations
    #       true/false (whether to regenerate files)
    #
    # My results were run with 50 10 1
    main(@ARGV);
    
    sub main {
        my ($file_size, $benchmark_n, $regenerate) = @_;
        $file_size *= 1000000;
        my @file_names = create_files($file_size, $regenerate);
        my %methods = (
            ln => \&method_ln,  # $.
            nn => \&method_nn,  # $n
            tr => \&method_tr,  # tr///
            ss => \&method_ss,  # s///
        );
        my $combo_iter = Set::CrossProduct->new([ [keys %methods], \@file_names ]);
        open my $log_fh, '>', 'log.txt';
        my %benchmark_args = map {
            my ($m, $f) = @$_;
            "$m-$f" => sub { $methods{$m}->($f, $log_fh) }
        } $combo_iter->combinations;
        cmpthese($benchmark_n, \%benchmark_args);
        close $log_fh;
    }
    
    sub create_files {
        my ($file_size, $regenerate) = @_;
        my %line_lengths = (
            S =>    50,
            M =>   500,
            L =>  5000,
        );
        for my $f (keys %line_lengths){
            next if -f $f and not $regenerate;
            create_file($f, $line_lengths{$f}, $file_size);
        }
        return keys %line_lengths;
    }
    
    sub create_file {
        my ($file_name, $line_length, $file_size) = @_;
        my $n_lines = int($file_size / $line_length);
        warn "Generating $file_name with $n_lines lines\n";
        my $line = 'a' x ($line_length - 1);
        chop $line if $^O eq 'MSWin32';
        open(my $fh, '>', $file_name) or die $!;
        print $fh $line, "\n" for 1 .. $n_lines;
        close $fh;
    }
    
    sub method_nn {
        my ($data_file, $log_fh) = @_;
        open my $data_fh, '<', $data_file;
        my $n = 0;
        $n ++ while <$data_fh>;
        print $log_fh "$data_file \$n $n\n";
        close $data_fh;
    }
    
    sub method_ln {
        my ($data_file, $log_fh) = @_;
        open my $data_fh, '<', $data_file;
        1 while <$data_fh>;
        print $log_fh "$data_file \$. $.\n";
        close $data_fh;
    }
    
    sub method_tr {
        my ($data_file, $log_fh) = @_;
        open my $data_fh, '<', $data_file;
        my $n = 0;
        my $buffer;
        while (sysread $data_fh, $buffer, 4096) {
            $n += ($buffer =~ tr/\n//);
        }
        print $log_fh "$data_file tr $n\n";
        close $data_fh;
    }
    
    sub method_ss {
        my ($data_file, $log_fh) = @_;
        open my $data_fh, '<', $data_file;
        my $n = 0;
        my $buffer;
        while (sysread $data_fh, $buffer, 4096) {
            $n += ($buffer =~ s/\n//g);
        }
        print $log_fh "$data_file s/ $n\n";
        close $data_fh;
    }
    

    Update in response to Brad's comment. I tried all three variants, and they behaved roughly like s/\n//g -- slower for the data files with shorter lines (with the additional qualification that s/(\n)/$1/ was even slower than the others). The interesting part was that m/\n/g was basically the same speed as s/\n//g, suggesting that the slowness of the regex approach (both s/// and m//) does not hinge directly on the matter of editing the string.

    0 讨论(0)
  • 2021-02-07 20:38

    Long lines are about 65 times larger than short lines, and your numbers indicate that tr/\n// runs exactly 65 times slower. This is as expected.

    wc evidently scales better for long lines. I don't really know why; perhaps because it is tuned to just count newlines, especially when you use the -l option.

    0 讨论(0)
  • 2021-02-07 20:41

    I'm also seeing tr/// get relatively slower as the line lengths increase although the effect isn't as dramatic. These results are from ActivePerl 5.10.1 (32-bit) on Windows 7 x64. I also got "too few iterations for a reliable count" warnings at 100 so I bumped the iterations up to 500.

            VL: 4501.06    288
            LO: 749.25     29
            SH: 69.38      6
            VA: 104.66     55
                Rate VL-$count     VL-$.     VL-tr      VL-s     VL-wc
    VL-$count 2.82/s        --       -0%      -52%      -56%      -99%
    VL-$.     2.83/s        0%        --      -51%      -56%      -99%
    VL-tr     5.83/s      107%      106%        --      -10%      -99%
    VL-s      6.45/s      129%      128%       11%        --      -99%
    VL-wc      501/s    17655%    17602%     8490%     7656%        --
                Rate LO-$count     LO-$.      LO-s     LO-tr     LO-wc
    LO-$count 16.5/s        --       -1%      -50%      -51%      -97%
    LO-$.     16.8/s        1%        --      -50%      -51%      -97%
    LO-s      33.2/s      101%       98%        --       -3%      -94%
    LO-tr     34.1/s      106%      103%        3%        --      -94%
    LO-wc      583/s     3424%     3374%     1655%     1609%        --
                Rate SH-$count     SH-$.      SH-s     SH-tr     SH-wc
    SH-$count  120/s        --       -7%      -31%      -67%      -81%
    SH-$.      129/s        7%        --      -26%      -65%      -80%
    SH-s       174/s       45%       35%        --      -52%      -73%
    SH-tr      364/s      202%      182%      109%        --      -43%
    SH-wc      642/s      433%      397%      269%       76%        --
                Rate VA-$count     VA-$.      VA-s     VA-tr     VA-wc
    VA-$count 92.6/s        --       -5%      -36%      -63%      -79%
    VA-$.     97.4/s        5%        --      -33%      -61%      -78%
    VA-s       146/s       57%       50%        --      -42%      -67%
    VA-tr      252/s      172%      159%       73%        --      -43%
    VA-wc      439/s      374%      351%      201%       74%        --
    

    Edit: I did a revised benchmark to compare the rates for different line lengths. It clearly shows that tr/// starts out with a big advantage for short lines that rapidly disappears as the lines grow longer. As for why this happens, I can only speculate that tr/// is optimized for short strings.

    Line count rate comparison http://img69.imageshack.us/img69/6250/linecount.th.png

    0 讨论(0)
提交回复
热议问题