Randomly Pick Lines From a File Without Slurping It With Unix

前端 未结 10 866
忘了有多久
忘了有多久 2020-12-07 11:40

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC

相关标签:
10条回答
  • 2020-12-07 11:53

    You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

    cat your_file.txt | perl -n -e 'print if (rand() < .01)'
    

    (simpler form, from comments):

    perl -ne 'print if (rand() < .01)' your_file.txt 
    
    0 讨论(0)
  • 2020-12-07 12:01

    I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

    This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

    Original thread: Submitted for your review -- random sampling in awk.

    # Waterman's Algorithm R for random sampling
    # by way of Knuth's The Art of Computer Programming, volume 2
    
    BEGIN {
        if (!n) {
            print "Usage: sample.awk -v n=[size]"
            exit
        }
        t = n
        srand()
    
    }
    
    NR <= n {
        pool[NR] = $0
        places[NR] = NR
        next
    
    }
    
    NR > n {
        t++
        M = int(rand()*t) + 1
        if (M <= n) {
            READ_NEXT_RECORD(M)
        }
    
    }
    
    END {
        if (NR < n) {
            print "sample.awk: Not enough records for sample" \
                > "/dev/stderr"
            exit
        }
        # gawk needs a numeric sort function
        # since it doesn't have one, zero-pad and sort alphabetically
        pad = length(NR)
        for (i in pool) {
            new_index = sprintf("%0" pad "d", i)
            newpool[new_index] = pool[i]
        }
        x = asorti(newpool, ordered)
        for (i = 1; i <= x; i++)
            print newpool[ordered[i]]
    
    }
    
    function READ_NEXT_RECORD(idx) {
        rec = places[idx]
        delete pool[rec]
        pool[NR] = $0
        places[idx] = NR  
    } 
    
    0 讨论(0)
  • 2020-12-07 12:01

    If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:

    file=/some/file
    awk -v percent=0.01 -v n="$(wc -l < "$file")" '
      BEGIN {srand(); p = int(n * percent)}
      rand() * n-- < p {p--; print}' < "$file"
    
    0 讨论(0)
  • 2020-12-07 12:09

    You could do it in two passes:

    • Run through the file once, just to count how many lines there are
    • Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
    • Run through the file once more and pick out the lines at the selected positions

    Example in python:

    fn = '/usr/share/dict/words'
    
    from random import randint
    from sys import stdout
    
    count = 0
    with open(fn) as f:
       for line in f:
          count += 1
    
    selected = set()
    while len(selected) < count//100:
       selected.add(randint(0, count-1))
    
    index = 0
    with open(fn) as f:
       for line in f:
          if index in selected:
              stdout.write(line)
          index += 1
    
    0 讨论(0)
提交回复
热议问题