Select random 3000 lines from a file with awk codes

北战南征 提交于 2019-12-10 14:49:20

问题


I want to select randomly 3000 lines from a sample.file which contains 8000 lines. I will do that with awk codes or do from command line. How can I do that?


回答1:


If you have gnu sort, it's easy:

sort -R FILE | head -n3000

If you have gnu shuf, it's even easier:

shuf -n3000 FILE



回答2:


awk 'BEGIN{srand();}
{a[NR]=$0}
END{for(i=1; i<=3000; i++){x=int(rand()*NR) + 1; print a[x];}}' yourFile



回答3:


Fixed as per Glenn's comment:

awk 'BEGIN {
  a=8000; l=3000
  srand(); nr[x]
  while (length(nr) <= l) 
    nr[int(rand() * a) +  1]    
  }
NR in nr
  ' infile 

P.S. Passing an array to the length built-in function is not portable, you've been warned :)




回答4:


You can use a combination of awk, sort, head/tail and sed to do this, such as with:

pax$ seq 1 100 | awk '
...$    BEGIN {srand()}
...$          {print rand() " " $0}
...$ ' | sort | head -5 | sed 's/[^ ]* //'
57
25
80
51
72

which, as you can see, selects five random lines from the one hundred generated in seq 1 100.

The awk trick prefixes each and every line in the file with a random number and space of the format "0.237788 ", then sort (obviously) sorts it based on that random number.

Then you use head (or tail if you don't have a head) to get the first (or last) N lines.

Finally, the sed will strip off the random number and space and the start of each line.

For your specific case, you could use something like (on one line):

awk 'BEGIN {srand()} {print rand() " " $0}' file8000.txt
    | sort
    | tail -3000
    | sed 's/[^ ]* //'
    >file3000.txt



回答5:


I used these commands, and got what I wanted:

awk 'BEGIN {srand()} {print rand() " " $0}' examples/data_text.txt | sort -n | tail -n 80 | awk '{printf "%1d %s %s\n",$2, $3, $4}' > examples/crossval.txt

which in fact randomly selects 80 lines from the input file.




回答6:


In PowerShell:

Get-Content myfile | Get-Random -Count 3000

or shorter:

gc myfile | random -c 3000



回答7:


In case you only need approximately 3000 lines, this is an easy method:

awk -v N=`cat FILE | wc -l` 'rand()<3000/N' FILE

The part between the backticks (`) gives the number of lines in the file.




回答8:


For a huge file that I didn't want to shuffle, this worked out well and pretty fast:

sed -u -n 'l1p;l2p; ... ;l1000p;l1000q'

The -u option reduces buffering, and l1, l2, ... l1000 are random and sorted line numbers obtained from R (would be just as good with python or perl).



来源:https://stackoverflow.com/questions/7514896/select-random-3000-lines-from-a-file-with-awk-codes

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!