What is the best way to gunzip files with Perl?

后端 未结 4 624
面向向阳花
面向向阳花 2021-01-16 07:12

Is there a faster solution than my actual \'zcat\' solution to gunzip files with Perl?

A little benchmark:

#!/usr/bin/perl

use strict;
use warnings;         


        
相关标签:
4条回答
  • 2021-01-16 07:31

    On typical desktop hardware, the zcat is all but certain to be I/O limited on non-trivial data (your sample files are awfully trivial, they'll be buffered for sure), in which case there isn't going to be any code-level optimization that will work for you. Spawning an external gzip seems perfect to me.

    0 讨论(0)
  • 2021-01-16 07:38

    I updated my benchmark with PerlIO::gzip as runrig suggested.

    My updated benchmark:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use Benchmark qw(cmpthese timethese);
    use IO::Uncompress::Gunzip qw(gunzip);
    use PerlIO::gzip;
    
    my $re = qr/test/;
    
    my $bench = timethese($ARGV[1], {
    
      zcat => sub {
        if (defined open(my $FILE, "-|", "zcat " . $ARGV[0]))
        {
          while (<$FILE>)
          {
            print $_  if ($_ =~ $re);
          }
          close($FILE);
        }
      },
    
      io_gunzip => sub {
        my $z = new IO::Uncompress::Gunzip $ARGV[0];
        while (<$z>)
        {
          print $_  if ($_ =~ $re);
        }
      },
    
      io_gunzip_getline => sub {
        my $z = new IO::Uncompress::Gunzip $ARGV[0];
        while (my $line = $z->getline())
        {
          print $line if ($line =~ $re);
        }
      },
    
      perlio_gzip => sub {
        if (defined open(my $FILE, "<:gzip", $ARGV[0]))
        {
          while (<$FILE>)
          {
            print $_  if ($_ =~ $re);
          }
          close($FILE);
        }
      },
    
    } );
    
    cmpthese $bench;
    
    1;
    

    New results:

    # zcat test.gz| wc -l
    566
    # zcat test2.gz| wc -l
    60459
    # zcat test3.gz| wc -l
    604590
    # ./zip_test.pl test.gz 1000
    Benchmark: timing 1000 iterations of io_gunzip, io_gunzip_getline, perlio_gzip, zcat...
     io_gunzip:  6 wallclock secs ( 6.07 usr +  0.03 sys =  6.10 CPU) @ 163.93/s (n=1000)
    io_gunzip_getline:  6 wallclock secs ( 5.23 usr +  0.02 sys =  5.25 CPU) @ 190.48/s (n=1000)
    perlio_gzip:  0 wallclock secs ( 0.62 usr +  0.01 sys =  0.63 CPU) @ 1587.30/s (n=1000)
          zcat:  6 wallclock secs ( 0.37 usr  0.98 sys +  0.94 cusr  2.86 csys =  5.15 CPU) @ 194.17/s (n=1000)
                        Rate    io_gunzip io_gunzip_getline         zcat perlio_gzip
    io_gunzip          164/s           --              -14%         -16%        -90%
    io_gunzip_getline  190/s          16%                --          -2%        -88%
    zcat               194/s          18%                2%           --        -88%
    perlio_gzip       1587/s         868%              733%         717%          --
    # ./zip_test.pl test2.gz 50
    Benchmark: timing 50 iterations of io_gunzip, io_gunzip_getline, perlio_gzip, zcat...
     io_gunzip: 30 wallclock secs (29.50 usr +  0.11 sys = 29.61 CPU) @  1.69/s (n=50)
    io_gunzip_getline: 25 wallclock secs (24.85 usr +  0.10 sys = 24.95 CPU) @  2.00/s (n=50)
    perlio_gzip:  4 wallclock secs ( 3.22 usr +  0.01 sys =  3.23 CPU) @ 15.48/s (n=50)
          zcat:  4 wallclock secs ( 2.35 usr  0.23 sys +  1.29 cusr  0.28 csys =  4.15 CPU) @ 12.05/s (n=50)
                        Rate    io_gunzip io_gunzip_getline         zcat perlio_gzip
    io_gunzip         1.69/s           --              -16%         -86%        -89%
    io_gunzip_getline 2.00/s          19%                --         -83%        -87%
    zcat              12.0/s         613%              501%           --        -22%
    perlio_gzip       15.5/s         817%              672%          28%          --
    # ./zip_test.pl test3.gz 50
    Benchmark: timing 50 iterations of io_gunzip, io_gunzip_getline, perlio_gzip, zcat...
     io_gunzip: 303 wallclock secs (299.28 usr +  1.30 sys = 300.58 CPU) @  0.17/s (n=50)
    io_gunzip_getline: 250 wallclock secs (248.26 usr +  0.79 sys = 249.05 CPU) @  0.20/s (n=50)
    perlio_gzip: 32 wallclock secs (32.03 usr +  0.20 sys = 32.23 CPU) @  1.55/s (n=50)
          zcat: 44 wallclock secs (24.64 usr  1.83 sys + 11.93 cusr  1.62 csys = 40.02 CPU) @  1.25/s (n=50)
                      s/iter    io_gunzip io_gunzip_getline         zcat perlio_gzip
    io_gunzip           6.01           --              -17%         -87%        -89%
    io_gunzip_getline   4.98          21%                --         -84%        -87%
    zcat               0.800         651%              522%           --        -19%
    perlio_gzip        0.645         833%              673%          24%          --
    

    PerlIO::gzip is the fastest solution !

    0 讨论(0)
  • 2021-01-16 07:54

    The last time I tried it, spawning an external gunzip was considerably faster than using a Perl module (just like your benchmarks show). I suspect it's all the method calls involved in tying a filehandle.

    I expect <$z> is slower than $z->getline for a similar reason. There's more magic involved in figuring out that the first needs to be translated into the second.

    0 讨论(0)
  • 2021-01-16 07:55

    And I also don't understand why while (<$z>) is slower than while (my $line = $z->getline())...

    Because $z is a self tied object, tied objects are notoriously slow, and <$z> uses the tied object interface to call getline() rather than directly calling the method.

    Also you can try PerlIO-gzip but I suspect it won't be any/much faster than the other module.

    0 讨论(0)
提交回复
热议问题