I have a lot of text files that look like this:
>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCC
To obtain random samples from a text file, without replacement, means that once a line has been randomly selected (sampled) it cannot be selected again. Thus, if 10 lines of 100 are to be selected, the ten random line numbers need to be unique.
Here is a script to produce NUM
random (without replacement) samples from a text FILE
:
#!/usr/bin/env bash
# random-samples.sh NUM FILE
# extract NUM random (without replacement) lines from FILE
num=$(( 10#${1:?'Missing sample size'} ))
file="${2:?'Missing file to sample'}"
lines=`wc -l <$file` # max num of lines in the file
# get_sample MAX
#
# get a random number between 1 .. max
# (see the bash man page on RANDOM
get_sample() {
local max="$1"
local rand=$(( ((max * RANDOM) / 32767) + 1 ))
echo "$rand"
}
# select_line LINE FILE
#
# select line LINE from FILE
select_line() {
head -n $1 $2 | tail -1
}
declare -A samples # keep track of samples
for ((i=1; i<=num; i++)) ; do
sample=
while [[ -z "$sample" ]]; do
sample=`get_sample $lines` # get a new sample
if [[ -n "${samples[$sample]}" ]]; then # already used?
sample= # yes, go again
else
(( samples[$sample]=1 )) # new sample, track it
fi
done
line=`select_line $sample $file` # fetch the sampled line
printf "%2d: %s\n" $i "$line"
done
exit
Here is the output of a few invocations:
./random-samples.sh 10 poetry-samples.txt
1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
4: 5. And miles to go before I sleep 5,350,000 Robert Frost
5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
7: 41. The quality of mercy is not strained 589,000 Shakespeare
8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
9: 42. They also serve who only stand and wait 584,000 Milton
10: 48. If you can keep your head when all about you 447,000Kipling
./random-samples.sh 10 poetry-samples.txt
1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
2: 34. Busy old fool, unruly sun 675,000 John Donne
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
7: 46. If music be the food of love, play on 507,000 Shakespeare
8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
10: 15. But at my back I always hear 2,010,000 Marvell
./random-samples.sh 10 poetry-samples.txt
1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
5: 42. They also serve who only stand and wait 584,000 Milton
6: 24. When in disgrace with fortune and men's eyes 1,100,000Shakespeare
7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare
How about this for a random sampling of 10% of your lines?
awk 'rand()>0.9' yourfile1 yourfile2 anotherfile
I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.
Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.
For added randomness (!) you can add an srand()
at the start as suggested by @klashxx
awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
Yes, but I wouldn't. I would use shuf
or sort -R
(neither POSIX) to randomize the file and then select the first n
lines using head
.
If you really want to use awk
for this, you would need to use the rand
function, as Mark Setchell points out.
Maybe it's better to sample the file using a fixed schema, like sampling one record each 10 lines. You can do that using this awk
one-liner:
awk '0==NR%10' filename
If you want to sample a percentage of the total, then you can program a way to calculate the number of rows the awk
one-liner should use so the number of records printed matches that quantity/percentage.
I hope this helps!