Following up on this question, I need to get exactly n
lines at random out of a file (or stdin
). This would be similar to head
or tail
, except I want some from the middle.
Now, other than looping over the file with the solutions to the linked question, what's the best way to get exactly n
lines in one run?
For reference, I tried this:
#!/usr/bin/perl -w
use strict;
my $ratio = shift;
print $ratio, "\n";
while () {
print if ((int rand $ratio) == 1);
}
where $ratio
is the rough percentage of lines I want. For instance, if I want 1 in 10 lines:
random_select 10 a.list
However, this doesn't give me an exact amount:
aaa> foreach i ( 0 1 2 3 4 5 6 7 8 9 )
foreach? random_select 10 a.list | wc -l
foreach? end
4739
4865
4739
4889
4934
4809
4712
4842
4814
4817
The other thought I had was slurping the input file and then choosing n
at random from the array, but that's a problem if I have a really big file.
Any ideas?
Edit: This is an exact duplicate of this question.
Here's a nice one-pass algorithm that I just came up with, having O(N) time complexity and O(M) space complexity, for reading M lines from an N-line file.
Assume M <= N.
- Let
S
be the set of chosen lines. InitializeS
to the firstM
lines of the file. If the ordering of the final result is important, shuffleS
now. - Read in the next line
l
. So far, we have readn = M + 1
total lines. The probability that we want to choosel
as one of our final lines is thereforeM/n
. - Accept
l
with probabilityM/n
; use a RNG to decide whether to accept or rejectl
. - If
l
has been accepted, randomly choose one of the lines inS
and replace it withl
. - Repeat steps 2-4 until the file has been exhausted of lines, incrementing
n
with each new line read. - Return the set
S
of chosen lines.
This takes a single command-line argument, which is the number of line you want, N. The first N lines are held, as you might not see any more. Thereafter, you randomly decide whether to take the next line. And if you do, you randomly decide which line in the current list-of-N to overwrite.
#!/usr/bin/perl
my $bufsize = shift;
my @list = ();
srand();
while (<>)
{
push(@list, $_), next if (@list < $bufsize);
$list[ rand(@list) ] = $_ if (rand($. / $bufsize) < 1);
}
print foreach @list;
Possible solution:
- scan one time to count the number of lines
- decide the line number to pick randomly
- scan again, pick the line
@result = ();
$k = 0;
while(<>) {
$k++;
if (scalar @result < $n) {
push @result, $_;
} else {
if (rand <= $n/$k) {
$result[int rand $n] = $_;
}
}
}
print for @result;
There's no need to know the actual line number in the file. Simply seek to a random place and keep the next line. (The current line will most likely be a partial line.)
This approach should be very fast for large files, but it will not work for STDIN. Heck, nothing sort of caching the entire file in memory will work for STDIN. So, if you must have STDIN, I don't see how you can be fast/cheap for large files.
You could detect STDIN and switch to a cached approach, otherwise be fast.
#!perl use strict; my $file='file.txt'; my $count=shift || 10; my $size=-s $file; open(FILE,$file) || die "Can't open $file\n"; while ($count--) { seek(FILE,int(rand($size)),0); $_=readline(FILE); # ignore partial line redo unless defined ($_ = readline(FILE)); # catch EOF print $_; }
In pseudo-code:
use List::Util qw[shuffle];
# read and shuffle the whole file
@list = shuffle(<>);
# take the first 'n' from the list
splice(@list, ...);
This is the most trivial implementation, but you do have to read the whole file first, which will require that you have sufficient memory available.
Here's some verbose Perl code that should work with large files.
The heart of this code is that it does not store the whole file in memory, but only stores offsets in the file.
Use tell
to get the offsets. Then seek
to the appropriate places to recover the lines.
Better specification of target file and number of lines to get is left as an exercise for those less lazy than I. Those problems have been well solved.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(shuffle);
my $GET_LINES = 10;
my @line_starts;
open( my $fh, '<', 'big_text_file' )
or die "Oh, fudge: $!\n";
do {
push @line_starts, tell $fh
} while ( <$fh> );
my $count = @line_starts;
print "Got $count lines\n";
my @shuffled_starts = (shuffle @line_starts)[0..$GET_LINES-1];
for my $start ( @shuffled_starts ) {
seek $fh, $start, 0
or die "Unable to seek to line - $!\n";
print scalar <$fh>;
}
来源:https://stackoverflow.com/questions/856494/how-can-i-get-exactly-n-random-lines-from-a-file-with-perl