How can I set the file-read buffer size in Perl to optimize it for large files?

前端 未结 4 579
暗喜
暗喜 2021-01-05 06:15

I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiqu

相关标签:
4条回答
  • 2021-01-05 06:45

    You can affect the buffering if you're running on an OS that supports setvbuf; see the documentation for IO::Handle.

    If you're using perl v5.10 or later then there is no need to explicitly create an IO::Handle object as described in the documentation, as all file handles are implicitly blessed into IO::Handle objects since that release.

    use 5.010;
    use strict;
    use warnings;
    
    use autodie;
    
    use IO::Handle '_IOLBF';
    
    open my $handle, '<:utf8', 'foo';
    
    my $buffer;
    $handle->setvbuf($buffer, _IOLBF, 0x10000);
    
    while ( my $line = <$handle> ) {
        ...
    }
    
    0 讨论(0)
  • 2021-01-05 06:48

    I'm necroposting since this came up on this perlmonks thread

    It's not possible to use setvbuf on perls using PerlIO, which the default since version 5.8.0. However, there is the PerlIO::buffersize module on CPAN that allows you to set the buffer size when opening a file:

        open my $fh, '<:buffersize(65536)', $filename;
    

    IIRC, you could also set the default for any new files by using this at the beginning of your script:

        use open ':buffersize(65536)';
    
    0 讨论(0)
  • 2021-01-05 06:49

    No, there's not (short of recompiling a modified perl), but you can read the whole file into memory, then work line by line from that:

    use File::Slurp;
    my $buffer = read_file("filename");
    open my $in_handle, "<", \$buffer;
    while ( my $line = readline($in_handle) ) {
    }
    

    Note that perl before 5.10 defaulted to using stdio buffers in most places (but often cheating and accessing the buffers directly, not through the stdio library), but in 5.10 and later defaults to its own perlio layer system. The latter seems to use a 4k buffer by default, but writing a layer that allows configuring this should be trivial (once you figure out how to write a layer: see perldoc perliol).

    0 讨论(0)
  • 2021-01-05 07:04

    Warning, the following code has only been light tested. The code below is a first shot at a function that will let you process a file line by line (hence the function name) with a user-definable buffer size. It takes up to four arguments:

    1. an open filehandle (default is STDIN)
    2. a buffer size (default is 4k)
    3. a reference to a variable to store the line in (default is $_)
    4. an anonymous subroutine to call on the file (the default prints the line).

    The arguments are positional with the exception that the last argument may always be the anonymous subroutine. Lines are auto-chomped.

    Probable bugs:

    • may not work on systems where line feed is the end of line character
    • will likely fail when combined with a lexical $_ (introduced in Perl 5.10)

    You can see from an strace that it reads the file with the specified buffer size. If I like how testing goes, you may see this on CPAN soon.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use Scalar::Util qw/reftype/;
    use Carp;
    
    sub line_by_line {
        local $_;
        my @args = \(
            my $fh      = \*STDIN,
            my $bufsize = 4*1024,
            my $ref     = \$_,
            my $coderef = sub { print "$_\n" },
        );
        croak "bad number of arguments" if @_ > @args;
    
        for my $arg_val (@_) {
            if (reftype $arg_val eq "CODE") {
                ${$args[-1]} = $arg_val;
                last;
            }
            my $arg = shift @args;
            $$arg = $arg_val;
        }
    
        my $buf;
        my $overflow ='';
        OUTER:
        while(sysread $fh, $buf, $bufsize) {
            my @lines = split /(\n)/, $buf;
            while (@lines) {
                my $line  = $overflow . shift @lines;
                unless (defined $lines[0]) {
                    $overflow = $line;
                    next OUTER;
                }
                $overflow = shift @lines;
                if ($overflow eq "\n") {
                    $overflow = "";
                } else {
                    next OUTER;
                }
                $$ref = $line;
                $coderef->();
            }
        }
        if (length $overflow) {
            $$ref = $overflow;
            $coderef->();
        }
    }
    
    my $bufsize = shift;
    
    open my $fh, "<", $0
        or die "could not open $0: $!";
    
    my $count;
    line_by_line $fh, sub {
        $count++ if /lines/;
    }, $bufsize;
    
    print "$count\n";
    
    0 讨论(0)
提交回复
热议问题