Randomly Pick Lines From a File Without Slurping It With Unix

前端未结

关注

 10  866

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

相关标签:

10条回答

攒了一身酷

2020-12-07 11:53
You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):
```
cat your_file.txt | perl -n -e 'print if (rand() < .01)'
```
(simpler form, from comments):
```
perl -ne 'print if (rand() < .01)' your_file.txt 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

时光说笑

2020-12-07 12:01

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
}

0 讨论(0)

-上瘾入骨i

2020-12-07 12:01
If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:
```
file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
  BEGIN {srand(); p = int(n * percent)}
  rand() * n-- < p {p--; print}' < "$file"
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-12-07 12:09
You could do it in two passes:
- Run through the file once, just to count how many lines there are
- Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
- Run through the file once more and pick out the lines at the selected positions
Example in python:
```
fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2