Randomly Pick Lines From a File Without Slurping It With Unix

前端 未结 10 865
忘了有多久
忘了有多久 2020-12-07 11:40

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

相关标签:
10条回答
  • 2020-12-07 11:44

    In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:

    NR < k {
        reservoir[NR] = $0;
    }
    NR >= k {
        i = int(NR * rand());
        if (i < k) {
            reservoir[i] = $0;
        }
    }
    END {
        for (i in reservoir) {
            print reservoir[i];
        }
    }
    

    Then figuring out what k is has to be done separately, for example by setting awk -v 'k=int('$(dc -e "$(cat FILE | wc -l) 0.01 * n")')'

    0 讨论(0)
  • 2020-12-07 11:46

    The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)

    A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.

    0 讨论(0)
  • 2020-12-07 11:46

    Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

    0 讨论(0)
  • 2020-12-07 11:47

    if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

    In that second case, just randomize at 1% at each line...

    awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
    

    If you'd like the header line plus a random sample of lines after, use:

    awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
    
    0 讨论(0)
  • 2020-12-07 11:50

    This should work on most any GNU/Linux machine.

    $ shuf -n $(( $(wc -l < $file) / 100)) $file
    

    I'd be surprised if memory management was done inappropriately by the GNU shuf command.

    0 讨论(0)
  • 2020-12-07 11:51

    I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.

    0 讨论(0)
提交回复
热议问题