How can I get exactly n random lines from a file with Perl?

Following up on this question, I need to get exactly n lines at random out of a file (or stdin). This would be similar to head or tail, except I want some from the middle.

Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n lines in one run?

For reference, I tried this:

#!/usr/bin/perl -w
use strict;
my $ratio = shift;
print $ratio, "\n";
while () {
    print if ((int rand $ratio) == 1); 
}

where $ratio is the rough percentage of lines I want. For instance, if I want 1 in 10 lines:

random_select 10 a.list

However, this doesn't give me an exact amount:

aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 )
foreach? random_select 10 a.list | wc -l
foreach? end
4739
4865
4739
4889
4934
4809
4712
4842
4814
4817

The other thought I had was slurping the input file and then choosing n at random from the array, but that's a problem if I have a really big file.

Any ideas?

Edit: This is an exact duplicate of this question.

Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.

Assume M <= N.

Let S be the set of chosen lines. Initialize S to the first M lines of the file. If the ordering of the final result is important, shuffle S now.
Read in the next line l. So far, we have read n = M + 1 total lines. The probability that we want to choose l as one of our final lines is therefore M/n.
Accept l with probability M/n; use a RNG to decide whether to accept or reject l.
If l has been accepted, randomly choose one of the lines in S and replace it with l.
Repeat steps 2-4 until the file has been exhausted of lines, incrementing n with each new line read.
Return the set S of chosen lines.

This takes a single command-line argument, which is the number of line you want, N. The first N lines are held, as you might not see any more. Thereafter, you randomly decide whether to take the next line. And if you do, you randomly decide which line in the current list-of-N to overwrite.

#!/usr/bin/perl
my $bufsize = shift;
my @list = ();

srand();
while (<>)
{
    push(@list, $_), next if (@list < $bufsize);
    $list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1);
}
print foreach @list;

Possible solution:

scan one time to count the number of lines
decide the line number to pick randomly
scan again, pick the line

@result = ();

$k = 0;
while(<>) {
    $k++;
    if (scalar @result < $n) {
        push @result, $_;
    } else {
        if (rand <= $n/$k) {
            $result[int rand $n] = $_;
        }
    }
}

print for @result;

There's no need to know the actual line number in the file. Simply seek to a random place and keep the next line. (The current line will most likely be a partial line.)

This approach should be very fast for large files, but it will not work for STDIN. Heck, nothing sort of caching the entire file in memory will work for STDIN. So, if you must have STDIN, I don't see how you can be fast/cheap for large files.

You could detect STDIN and switch to a cached approach, otherwise be fast.

#!perl
use strict;

my $file='file.txt';
my $count=shift || 10;
my $size=-s $file;

open(FILE,$file) || die "Can't open $file\n";

while ($count--) {
   seek(FILE,int(rand($size)),0);
   $_=readline(FILE);                         # ignore partial line
   redo unless defined ($_ = readline(FILE)); # catch EOF
   print $_;
}

In pseudo-code:

use List::Util qw[shuffle];

# read and shuffle the whole file
@list = shuffle(<>);

# take the first 'n' from the list
splice(@list, ...);

This is the most trivial implementation, but you do have to read the whole file first, which will require that you have sufficient memory available.

Here's some verbose Perl code that should work with large files.

The heart of this code is that it does not store the whole file in memory, but only stores offsets in the file.

Use tell to get the offsets. Then seek to the appropriate places to recover the lines.

Better specification of target file and number of lines to get is left as an exercise for those less lazy than I. Those problems have been well solved.

#!/usr/bin/perl

use strict;
use warnings;

use List::Util qw(shuffle);

my $GET_LINES = 10; 

my @line_starts;
open( my $fh, '<', 'big_text_file' )
    or die "Oh, fudge: $!\n";

do {
    push @line_starts, tell $fh
} while ( <$fh> );

my $count = @line_starts;
print "Got $count lines\n";

my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1];

for my $start ( @shuffled_starts ) {

    seek $fh, $start, 0
        or die "Unable to seek to line - $!\n";

    print scalar <$fh>;
}

来源：https://stackoverflow.com/questions/856494/how-can-i-get-exactly-n-random-lines-from-a-file-with-perl

标签

perl

random-sample

file-processing