I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC
You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):
cat your_file.txt | perl -n -e 'print if (rand() < .01)'
(simpler form, from comments):
perl -ne 'print if (rand() < .01)' your_file.txt
I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.
This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)
Original thread: Submitted for your review -- random sampling in awk.
# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2
BEGIN {
if (!n) {
print "Usage: sample.awk -v n=[size]"
exit
}
t = n
srand()
}
NR <= n {
pool[NR] = $0
places[NR] = NR
next
}
NR > n {
t++
M = int(rand()*t) + 1
if (M <= n) {
READ_NEXT_RECORD(M)
}
}
END {
if (NR < n) {
print "sample.awk: Not enough records for sample" \
> "/dev/stderr"
exit
}
# gawk needs a numeric sort function
# since it doesn't have one, zero-pad and sort alphabetically
pad = length(NR)
for (i in pool) {
new_index = sprintf("%0" pad "d", i)
newpool[new_index] = pool[i]
}
x = asorti(newpool, ordered)
for (i = 1; i <= x; i++)
print newpool[ordered[i]]
}
function READ_NEXT_RECORD(idx) {
rec = places[idx]
delete pool[rec]
pool[NR] = $0
places[idx] = NR
}
If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l
), one to select the sample:
file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
BEGIN {srand(); p = int(n * percent)}
rand() * n-- < p {p--; print}' < "$file"
You could do it in two passes:
Example in python:
fn = '/usr/share/dict/words'
from random import randint
from sys import stdout
count = 0
with open(fn) as f:
for line in f:
count += 1
selected = set()
while len(selected) < count//100:
selected.add(randint(0, count-1))
index = 0
with open(fn) as f:
for line in f:
if index in selected:
stdout.write(line)
index += 1