I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC
In this case, reservoir sampling to get exactly k values is trivial enough with awk
that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk
program for sampling:
NR < k {
reservoir[NR] = $0;
}
NR >= k {
i = int(NR * rand());
if (i < k) {
reservoir[i] = $0;
}
}
END {
for (i in reservoir) {
print reservoir[i];
}
}
Then figuring out what k is has to be done separately, for example by setting awk -v 'k=int('$(dc -e "$(cat FILE | wc -l) 0.01 * n")')'
The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)
A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.
Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.
if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?
In that second case, just randomize at 1% at each line...
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
If you'd like the header line plus a random sample of lines after, use:
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
This should work on most any GNU/Linux machine.
$ shuf -n $(( $(wc -l < $file) / 100)) $file
I'd be surprised if memory management was done inappropriately by the GNU shuf command.
I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.