How can I get n
random lines from very large files that can't fit in memory.
Also it would be great if I could add filters before or after the randomization.
update 1
in my case the specs are :
- > 100 million lines
- > 10GB files
- usual random batch size 10000-30000
- 512RAM hosted ubuntu server 14.10
so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem
Here's a wee bash function for you. It grabs, as you say, a "batch" of lines, with a random start point within a file.
randline() {
local lines c r _
# cache the number of lines in this file in a symlink in the temp dir
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
# Pick a random number...
r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
echo "start=$r" >&2
# And start displaying $2 lines before that number.
head -n $r "$1" | tail -n ${2:-1}
}
Edit the echo
lines as required.
This solution has the advantage of fewer pipes, less resource-intensive pipes (i.e. no | sort ... |
), less platform dependence (i.e. no sort -R
which is GNU-sort-specific).
Note that this relies on Bash's $RANDOM
variable, which may or may not actually be random. Also, it will miss lines if your source file contains more than 32768^2 lines, and there's an failure edge case if the number of lines you've specificed (N) is >1 and the random start point is less than N lines from the beginning. Solving that is left as an exercise for the reader. :)
UPDATE #1:
mklement0 asks an excellent question in comments about potential performance issues with the head ... | tail ...
approach. I honestly don't know the answer, but I would hope that both head
and tail
are optimized sufficiently that they wouldn't buffer ALL input prior to displaying their output.
On the off chance that my hope is unfulfilled, here's an alternative. It's an awk-based "sliding window" tail. I'll embed it in the earlier function I wrote so you can test it if you want.
randline() {
local lines c r _
# Line count cache, per the first version of this function...
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
echo "start=$r" >&2
# This simply pipes the functionality of the `head | tail` combo above
# through a single invocation of awk.
# It should handle any size of input file with the same load/impact.
awk -v lines=${2:-1} -v count=0 -v start=$r '
NR < start { next; }
{ out[NR]=$0; count++; }
count > lines { delete out[start++]; count--; }
END {
for(i=start;i<start+lines;i++) {
print out[i];
}
}
' "$1"
}
The embedded awk script replaces the head ... | tail ...
pipeline in the previous version of the function. It works as follows:
- It skips lines until the "start" as determined by earlier randomization.
- It records the current line to an array.
- If the array is greater than the number of lines we want to keep, it eliminates the first record.
- At the end of the file, it prints the recorded data.
The result is that the awk process shouldn't grow its memory footprint because the output array gets trimmed as fast as it's built.
NOTE: I haven't actually tested this with your data.
UPDATE #2:
Hrm, with the update to your question that you want N random lines rather than a block of lines starting at a random point, we need a different strategy. The system limitations you've imposed are pretty severe. The following might be an option, also using awk, with random numbers still from Bash:
randlines() {
local lines c r _
# Line count cache...
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
# Create a LIST of random numbers, from 1 to the size of the file ($c)
for (( i=0; i<$2; i++ )); do
echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
done | awk '
# And here inside awk, build an array of those random numbers, and
NR==FNR { lines[$1]; next; }
# display lines from the input file that match the numbers.
FNR in lines
' - "$1"
}
This works by feeding a list of random line numbers into awk as a "first" file, then having awk print lines from the "second" file whose line numbers were included in the "first" file. It uses wc
to determine the upper limit of the random numbers to generate. That means you'll be reading this file twice. If you have another source for the number of lines in the file (a database for example), do plug it in here. :)
A limiting factor might be the size of that first file, which must be loaded into memory. I believe that the 30000 random numbers should only take about 170KB of memory, but how the array gets represented in RAM depends on the implementation of awk you're using. (Though usually, awk implementations (including Gawk in Ubuntu) are pretty good at keeping memory wastage to a minimum.)
Does this work for you?
In such limiting factors, the following approach will be better.
- seek to random position in the file (e.g. you will be "inside" in some line)
- go backward from this position and find the start of the given line
- go forward and print the full line
For this you need a tool that can seek in files, for example perl
.
use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256
my($want, $filename) = @ARGV;
my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");
my $buffer;
my $cnt;
while($want > $cnt++) {
my $randpos = int(rand($endpos)); #random file position
my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
$seekpos = 0 if( $seekpos < 0 );
sysseek($fd, $seekpos, SEEK_SET); #seek to position
my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters
my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer
my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
my $lineend = index $buffer, "\n", $linestart; #find the end of line in the buffer
my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;
print "$the_line\n";
}
Save the above into some file such "randlines.pl" and use it as:
perl randlines.pl wanted_count_of_lines file_name
e.g.
perl randlines.pl 10000 ./BIGFILE
The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).
Simple (but slow) solution
n=15 #number of random lines
filter_before | sort -R | head -$n | filter_after
#or, if you could have duplicate lines
filter_before | nl | sort -R | cut -f2- | head -$n | filter_after
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
or if you want, save the following into a randlines
script
#!/bin/bash
nl | sort -R | cut -f2 | head -"${1:-10}"
and use it as:
filter_before | randlines 55 | filter_after #for 55 lines
How it works:
The sort -R
sorts the file by the calculated random hashes for each line, so you will get an randomised order of lines, therefore the first N lines are random lines.
Because the hashing produces the same hash for the same line, duplicate lines are not treated as different. Is possible eliminate the duplicate lines adding the line number (with nl
), so the sort will never got an exact duplicate. After the sort
removing the added line numbers.
example:
seq -f 'some line %g' 500 | nl | sort -R | cut -f2- | head -3
prints in subsequent runs:
some line 65
some line 420
some line 290
some line 470
some line 226
some line 132
some line 433
some line 424
some line 196
demo with duplicate lines:
yes 'one
two' | head -10 | nl | sort -R | cut -f2- | head -3
in subsequent runs print:
one
two
two
one
two
one
one
one
two
Finally, if you want could use, instead of the cut
sed
too:
sed -r 's/^\s*[0-9][0-9]*\t//'
#!/bin/bash
#contents of bashScript.sh
file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;
totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;
chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;
cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";
usage:
get 1000 random sample
bashScript.sh /path/to/largefile.txt 1000
line has numbers
bashScript.sh /path/to/largefile.txt 1000 "[0-9]"
no mike and jane
bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"
I've used rl
for line randomnisation and found it to perform quite well. Not sure how it scales to your case (you'd simply do e.g. rl FILE | head -n NUM
). You can get it here: http://arthurdejong.org/rl/
来源:https://stackoverflow.com/questions/29102589/get-random-lines-from-large-files-in-bash