How can I get exactly n random lines from a file with Perl?

混江龙づ霸主 提交于 2019-12-04 06:32:04

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

  1. Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
  2. Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
  3. Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
  4. If l has been accepted, randomly choose one of the lines in S and replace it with l.
  5. Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
  6. Return the set S of chosen lines.

This takes a single command-line argument, which is the number of line you want, N. The first N lines are held, as you might not see any more. Thereafter, you randomly decide whether to take the next line. And if you do, you randomly decide which line in the current list-of-N to overwrite.

#!/usr/bin/perl
my $bufsize = shift;
my @list = ();

srand();
while (<>)
{
    push(@list, $_), next if (@list < $bufsize);
    $list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1);
}
print foreach @list;

Possible solution:

  1. scan one time to count the number of lines
  2. decide the line number to pick randomly
  3. scan again, pick the line
@result = ();

$k = 0;
while(<>) {
    $k++;
    if (scalar @result < $n) {
        push @result, $_;
    } else {
        if (rand <= $n/$k) {
            $result[int rand $n] = $_;
        }
    }
}

print for @result;

There's no need to know the actual line number in the file. Simply seek to a random place and keep the next line. (The current line will most likely be a partial line.)

This approach should be very fast for large files, but it will not work for STDIN. Heck, nothing sort of caching the entire file in memory will work for STDIN. So, if you must have STDIN, I don't see how you can be fast/cheap for large files.

You could detect STDIN and switch to a cached approach, otherwise be fast.

#!perl
use strict;

my $file='file.txt';
my $count=shift || 10;
my $size=-s $file;

open(FILE,$file) || die "Can't open $file\n";

while ($count--) {
   seek(FILE,int(rand($size)),0);
   $_=readline(FILE);                         # ignore partial line
   redo unless defined ($_ = readline(FILE)); # catch EOF
   print $_;
}

In pseudo-code:

use List::Util qw[shuffle];

# read and shuffle the whole file
@list = shuffle(<>);

# take the first 'n' from the list
splice(@list, ...);

This is the most trivial implementation, but you do have to read the whole file first, which will require that you have sufficient memory available.

Here's some verbose Perl code that should work with large files.

The heart of this code is that it does not store the whole file in memory, but only stores offsets in the file.

Use tell to get the offsets. Then seek to the appropriate places to recover the lines.

Better specification of target file and number of lines to get is left as an exercise for those less lazy than I. Those problems have been well solved.

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(shuffle);

my $GET_LINES = 10; 

my @line_starts;
open( my $fh, '<', 'big_text_file' )
    or die "Oh, fudge: $!\n";

do {
    push @line_starts, tell $fh
} while ( <$fh> );

my $count = @line_starts;
print "Got $count lines\n";

my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1];

for my $start ( @shuffled_starts ) {

    seek $fh, $start, 0
        or die "Unable to seek to line - $!\n";

    print scalar <$fh>;
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!