Randomly Pick Lines From a File Without Slurping It With Unix

前端未结

关注

 10  865

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

相关标签:

10条回答

攒了一身酷

2020-12-07 11:44
In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:
```
NR < k {
    reservoir[NR] = $0;
}
NR >= k {
    i = int(NR * rand());
    if (i < k) {
        reservoir[i] = $0;
    }
}
END {
    for (i in reservoir) {
        print reservoir[i];
    }
}
```
Then figuring out what k is has to be done separately, for example by setting awk -v 'k=int('$(dc -e "$(cat FILE | wc -l) 0.01 * n")')'
0 讨论(0)
发布评论:

提交评论
- 加载中...
旧时难觅i

2020-12-07 11:46

The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)

A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-07 11:46

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

0 讨论(0)
发布评论:

提交评论
- 加载中...
自闭症患者

2020-12-07 11:47
if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...
```
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'
```
If you'd like the header line plus a random sample of lines after, use:
```
awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
感情败类

2020-12-07 11:50
This should work on most any GNU/Linux machine.
```
$ shuf -n $(( $(wc -l < $file) / 100)) $file
```
I'd be surprised if memory management was done inappropriately by the GNU shuf command.
0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-12-07 11:51

I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页