I understand that both Java and Perl try quite hard to find a one-size-fits all default buffer size when reading in files, but I find their choices to be increasingly antiqu
You can affect the buffering if you're running on an OS that supports setvbuf
; see the documentation for IO::Handle.
If you're using perl v5.10 or later then there is no need
to explicitly create an IO::Handle
object as described in the documentation, as all file handles are implicitly blessed into IO::Handle
objects since that release.
use 5.010;
use strict;
use warnings;
use autodie;
use IO::Handle '_IOLBF';
open my $handle, '<:utf8', 'foo';
my $buffer;
$handle->setvbuf($buffer, _IOLBF, 0x10000);
while ( my $line = <$handle> ) {
...
}
I'm necroposting since this came up on this perlmonks thread
It's not possible to use setvbuf on perls using PerlIO, which the default since version 5.8.0. However, there is the PerlIO::buffersize module on CPAN that allows you to set the buffer size when opening a file:
open my $fh, '<:buffersize(65536)', $filename;
IIRC, you could also set the default for any new files by using this at the beginning of your script:
use open ':buffersize(65536)';
No, there's not (short of recompiling a modified perl), but you can read the whole file into memory, then work line by line from that:
use File::Slurp;
my $buffer = read_file("filename");
open my $in_handle, "<", \$buffer;
while ( my $line = readline($in_handle) ) {
}
Note that perl before 5.10 defaulted to using stdio buffers in most places (but often cheating and accessing the buffers directly, not through the stdio library), but in 5.10 and later defaults to its own perlio layer system. The latter seems to use a 4k buffer by default, but writing a layer that allows configuring this should be trivial (once you figure out how to write a layer: see perldoc perliol).
Warning, the following code has only been light tested. The code below is a first shot at a function that will let you process a file line by line (hence the function name) with a user-definable buffer size. It takes up to four arguments:
STDIN
)$_
)The arguments are positional with the exception that the last argument may always be the anonymous subroutine. Lines are auto-chomped.
Probable bugs:
$_
(introduced in Perl 5.10)You can see from an strace
that it reads the file with the specified buffer size. If I like how testing goes, you may see this on CPAN soon.
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw/reftype/;
use Carp;
sub line_by_line {
local $_;
my @args = \(
my $fh = \*STDIN,
my $bufsize = 4*1024,
my $ref = \$_,
my $coderef = sub { print "$_\n" },
);
croak "bad number of arguments" if @_ > @args;
for my $arg_val (@_) {
if (reftype $arg_val eq "CODE") {
${$args[-1]} = $arg_val;
last;
}
my $arg = shift @args;
$$arg = $arg_val;
}
my $buf;
my $overflow ='';
OUTER:
while(sysread $fh, $buf, $bufsize) {
my @lines = split /(\n)/, $buf;
while (@lines) {
my $line = $overflow . shift @lines;
unless (defined $lines[0]) {
$overflow = $line;
next OUTER;
}
$overflow = shift @lines;
if ($overflow eq "\n") {
$overflow = "";
} else {
next OUTER;
}
$$ref = $line;
$coderef->();
}
}
if (length $overflow) {
$$ref = $overflow;
$coderef->();
}
}
my $bufsize = shift;
open my $fh, "<", $0
or die "could not open $0: $!";
my $count;
line_by_line $fh, sub {
$count++ if /lines/;
}, $bufsize;
print "$count\n";