Sampling without replacement using awk

前端 未结 4 608
孤街浪徒
孤街浪徒 2020-12-11 04:35

I have a lot of text files that look like this:

>ALGKAHOLAGGATACCATAGATGGCACGCCCT
>BLGKAHOLAGGATACCATAGATGGCACGCCCT
>HLGKAHOLAGGATACCATAGATGGCACGCCC         


        
相关标签:
4条回答
  • 2020-12-11 04:57

    To obtain random samples from a text file, without replacement, means that once a line has been randomly selected (sampled) it cannot be selected again. Thus, if 10 lines of 100 are to be selected, the ten random line numbers need to be unique.

    Here is a script to produce NUM random (without replacement) samples from a text FILE:

    #!/usr/bin/env bash
    # random-samples.sh NUM FILE
    # extract NUM random (without replacement) lines from FILE
    
    num=$(( 10#${1:?'Missing sample size'} ))
    file="${2:?'Missing file to sample'}"
    
    lines=`wc -l <$file`   # max num of lines in the file
    
    # get_sample MAX
    #
    # get a random number between 1 .. max
    # (see the bash man page on RANDOM
    
    get_sample() {
      local max="$1"
      local rand=$(( ((max * RANDOM) / 32767) + 1 ))
      echo "$rand"
    }
    
    # select_line LINE FILE
    #
    # select line LINE from FILE
    
    select_line() {
      head -n $1 $2 | tail -1
    }
    
    declare -A samples     # keep track of samples
    
    for ((i=1; i<=num; i++)) ; do
      sample=
      while [[ -z "$sample" ]]; do
        sample=`get_sample $lines`               # get a new sample
        if [[ -n "${samples[$sample]}" ]]; then  # already used?
          sample=                                # yes, go again
        else
          (( samples[$sample]=1 ))               # new sample, track it
        fi
      done
      line=`select_line $sample $file`           # fetch the sampled line
      printf "%2d: %s\n" $i "$line"
    done
    exit
    

    Here is the output of a few invocations:

    ./random-samples.sh 10 poetry-samples.txt
     1: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
     2: 25. Hope springs eternal in the human breast 1,080,000 Alexander Pope
     3: 43. The moving finger writes; and, having writ,/Moves on571,000 Edward Fitzgerald
     4: 5. And miles to go before I sleep 5,350,000 Robert Frost
     5: 6. Not with a bang but a whimper 5,280,000 T.S. Eliot
     6: 40. In Xanadu did Kubla Khan 594,000 Coleridge
     7: 41. The quality of mercy is not strained 589,000 Shakespeare
     8: 7. Tread softly because you tread on my dreams 4,860,000 W.B. Yeats
     9: 42. They also serve who only stand and wait 584,000 Milton
    10: 48. If you can keep your head when all about you 447,000Kipling
    
    ./random-samples.sh 10 poetry-samples.txt
     1: 38. Shall I compare thee to a summers day 638,000 Shakespeare
     2: 34. Busy old fool, unruly sun 675,000 John Donne
     3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
     4: 45. We few, we happy few, we band of brothers 521,000Shakespeare
     5: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
     6: 11. Because I could not stop for death/He kindly stopped for me 2,360,000 Emily Dickinson
     7: 46. If music be the food of love, play on 507,000 Shakespeare
     8: 44. What is this life if, full of care,/We have no time to stand and stare 528,000 W.H. Davies
     9: 35. Do not go gentle into that good night 665,000 Dylan Thomas
    10: 15. But at my back I always hear 2,010,000 Marvell
    
    ./random-samples.sh 10 poetry-samples.txt
     1: 26. I think that I shall never see/A poem lovely as a tree. 1,080,000 Joyce Kilmer
     2: 32. Human kind/Cannot bear very much reality 891,000 T.S. Eliot
     3: 14. Candy/Is dandy/But liquor/Is quicker 2,150,000 Ogden Nash
     4: 13. My mistress’ eyes are nothing like the sun 2,230,000Shakespeare
     5: 42. They also serve who only stand and wait 584,000 Milton
     6: 24. When in disgrace with fortune and men's eyes 1,100,000Shakespeare
     7: 21. A narrow fellow in the grass 1,310,000 Emily Dickinson
     8: 9. Look on my works, ye mighty, and despair 3,080,000 Shelley
     9: 10. Tis better to have loved and lost/Than never to have loved at all 2,400,000 Tennyson
    10: 31. O Romeo, Romeo; wherefore art thou Romeo 912,000Shakespeare
    
    0 讨论(0)
  • 2020-12-11 05:05

    How about this for a random sampling of 10% of your lines?

    awk 'rand()>0.9' yourfile1 yourfile2 anotherfile
    

    I am not sure what you mean by "replacement"... there is no replacement occurring here, just random selection.

    Basically, it looks at each line of each file precisely once and generates a random number on the interval 0 to 1. If the random number is greater than 0.9, the line is output. So basically it is rolling a 10 sided dice for each line and only printing it if the dice comes up as 10. No chance of a line being printed twice - unless it occurs twice in your files, of course.

    For added randomness (!) you can add an srand() at the start as suggested by @klashxx

    awk 'BEGIN{srand()} rand()>0.9' yourfile(s)
    
    0 讨论(0)
  • 2020-12-11 05:14

    Yes, but I wouldn't. I would use shuf or sort -R (neither POSIX) to randomize the file and then select the first n lines using head.

    If you really want to use awk for this, you would need to use the rand function, as Mark Setchell points out.

    0 讨论(0)
  • 2020-12-11 05:20

    Maybe it's better to sample the file using a fixed schema, like sampling one record each 10 lines. You can do that using this awk one-liner:

    awk '0==NR%10' filename
    

    If you want to sample a percentage of the total, then you can program a way to calculate the number of rows the awk one-liner should use so the number of records printed matches that quantity/percentage.

    I hope this helps!

    0 讨论(0)
提交回复
热议问题