问题

How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.

update 1

in my case the specs are :

> 100 million lines
> 10GB files
usual random batch size 10000-30000
512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

回答1:

Here's a wee bash function for you. It grabs, as you say, a "batch" of lines, with a random start point within a file.

randline() {
  local lines c r _

  # cache the number of lines in this file in a symlink in the temp dir
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Pick a random number...
  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
  echo "start=$r" >&2

  # And start displaying $2 lines before that number.
  head -n $r "$1" | tail -n ${2:-1}
}

Edit the echo lines as required.

This solution has the advantage of fewer pipes, less resource-intensive pipes (i.e. no | sort ... |), less platform dependence (i.e. no sort -R which is GNU-sort-specific).

Note that this relies on Bash's $RANDOM variable, which may or may not actually be random. Also, it will miss lines if your source file contains more than 32768^2 lines, and there's an failure edge case if the number of lines you've specificed (N) is >1 and the random start point is less than N lines from the beginning. Solving that is left as an exercise for the reader. :)

UPDATE #1:

mklement0 asks an excellent question in comments about potential performance issues with the head ... | tail ... approach. I honestly don't know the answer, but I would hope that both head and tail are optimized sufficiently that they wouldn't buffer ALL input prior to displaying their output.

On the off chance that my hope is unfulfilled, here's an alternative. It's an awk-based "sliding window" tail. I'll embed it in the earlier function I wrote so you can test it if you want.

randline() {
  local lines c r _

  # Line count cache, per the first version of this function...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]

  echo "start=$r" >&2

  # This simply pipes the functionality of the `head | tail` combo above
  # through a single invocation of awk.
  # It should handle any size of input file with the same load/impact.
  awk -v lines=${2:-1} -v count=0 -v start=$r '
    NR < start { next; }
    { out[NR]=$0; count++; }
    count > lines { delete out[start++]; count--; }
    END {
      for(i=start;i<start+lines;i++) {
        print out[i];
      }
    }
  ' "$1"
}

The embedded awk script replaces the head ... | tail ... pipeline in the previous version of the function. It works as follows:

It skips lines until the "start" as determined by earlier randomization.
It records the current line to an array.
If the array is greater than the number of lines we want to keep, it eliminates the first record.
At the end of the file, it prints the recorded data.

The result is that the awk process shouldn't grow its memory footprint because the output array gets trimmed as fast as it's built.

NOTE: I haven't actually tested this with your data.

UPDATE #2:

Hrm, with the update to your question that you want N random lines rather than a block of lines starting at a random point, we need a different strategy. The system limitations you've imposed are pretty severe. The following might be an option, also using awk, with random numbers still from Bash:

randlines() {
  local lines c r _

  # Line count cache...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Create a LIST of random numbers, from 1 to the size of the file ($c)
  for (( i=0; i<$2; i++ )); do
    echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
  done | awk '
    # And here inside awk, build an array of those random numbers, and
    NR==FNR { lines[$1]; next; }
    # display lines from the input file that match the numbers.
    FNR in lines
  ' - "$1"
}

This works by feeding a list of random line numbers into awk as a "first" file, then having awk print lines from the "second" file whose line numbers were included in the "first" file. It uses wc to determine the upper limit of the random numbers to generate. That means you'll be reading this file twice. If you have another source for the number of lines in the file (a database for example), do plug it in here. :)

A limiting factor might be the size of that first file, which must be loaded into memory. I believe that the 30000 random numbers should only take about 170KB of memory, but how the array gets represented in RAM depends on the implementation of awk you're using. (Though usually, awk implementations (including Gawk in Ubuntu) are pretty good at keeping memory wastage to a minimum.)

Does this work for you?

回答2:

In such limiting factors, the following approach will be better.

seek to random position in the file (e.g. you will be "inside" in some line)
go backward from this position and find the start of the given line
go forward and print the full line

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

e.g.

perl randlines.pl 10000 ./BIGFILE

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

回答3:

Simple (but slow) solution

n=15 #number of random lines
filter_before | sort -R | head -$n | filter_after

#or, if you could have duplicate lines
filter_before | nl | sort -R | cut -f2- | head -$n | filter_after
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

or if you want, save the following into a randlines script

#!/bin/bash
nl | sort -R | cut -f2 | head -"${1:-10}"

and use it as:

filter_before | randlines 55 | filter_after   #for 55 lines

How it works:

The sort -R sorts the file by the calculated random hashes for each line, so you will get an randomised order of lines, therefore the first N lines are random lines.

Because the hashing produces the same hash for the same line, duplicate lines are not treated as different. Is possible eliminate the duplicate lines adding the line number (with nl), so the sort will never got an exact duplicate. After the sort removing the added line numbers.

example:

seq -f 'some line %g' 500 | nl | sort -R | cut -f2- | head -3

prints in subsequent runs:

some line 65
some line 420
some line 290

some line 470
some line 226
some line 132

some line 433
some line 424
some line 196

demo with duplicate lines:

yes 'one
two' | head -10 | nl | sort -R | cut -f2- | head -3

in subsequent runs print:

one
two
two

one
two
one

one
one
two

Finally, if you want could use, instead of the cut sed too:

sed -r 's/^\s*[0-9][0-9]*\t//'

回答4:

#!/bin/bash
#contents of bashScript.sh

file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;

totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;

chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;

cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";

usage:

get 1000 random sample

bashScript.sh /path/to/largefile.txt 1000

line has numbers

bashScript.sh /path/to/largefile.txt 1000 "[0-9]"

no mike and jane

bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"

回答5:

I've used rlfor line randomnisation and found it to perform quite well. Not sure how it scales to your case (you'd simply do e.g. rl FILE | head -n NUM). You can get it here: http://arthurdejong.org/rl/

来源：https://stackoverflow.com/questions/29102589/get-random-lines-from-large-files-in-bash

标签

bash