Randomly selecting lines from files

后端 未结 7 680
轮回少年
轮回少年 2021-01-13 14:21

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can

相关标签:
7条回答
  • 2021-01-13 14:56

    Another Python option; reading the contents of all files into memory:

    import random
    import fileinput
    
    def openhook(filename, mode):
        f = open(filename, mode)
        headers = [f.readline() for _ in range(5)]
        return f
    
    num_entries = 3
    lines = list(fileinput.input(openhook=openhook))
    print random.sample(lines, num_entries)
    
    0 讨论(0)
  • 2021-01-13 14:57
    sed "1,5d" < FILENAME | sort -R | head -2
    
    0 讨论(0)
  • 2021-01-13 15:04

    Answer is in Python. Assuming you can read a whole file into memory.

    #using python 2.6
    import sys
    import os
    import itertools
    import random
    
    def main(directory, num_files=5, num_entries=5):
        file_paths = os.listdir(directory)
    
        # get a random sampling of the available paths
        chosen_paths = random.sample(file_paths, num_files)
    
        for path in chosen_paths:
            chosen_entries = get_random_entries(path, num_entries)
            for entry in chosen_entries:
                # do something with your chosen entries
                print entry
    
    def get_random_entries(file_path, num_entries):
        with open(file_path, 'r') as file:
            # read the lines and slice off the headers
            lines = file.readlines()[5:]
    
            # group the lines into pairs (i.e. entries)
            entries = list(itertools.izip_longest(*[iter(lines)]*2))
    
            # return a random sampling of entries
            return random.sample(entries, num_entries)
    
    if __name__ == '__main__':
        #use optparse here to do fancy things with the command line args
        main(sys.argv[1:])
    

    Links: itertools, random, optparse

    0 讨论(0)
  • 2021-01-13 15:07

    Python solution - reads file only once and requires little memory

    Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines

    from random import randrange
    
    def getRandomItems(f, skipFirst=0, numItems=1):
        for _ in xrange(skipFirst):
            f.next()
        n = 0; r = []
        while True:
            try:
                nxt = [f.next() for _ in range(numItems)]
            except StopIteration: break
            n += 1
            if not randrange(n):
                r = nxt
        return r
    

    Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:

    >>> s={}
    >>> for i in xrange(5000):
    ...     r = getRandomItems(iter(xrange(50)))[0]
    ...     s[r] = 1 + s.get(r,0)
    ... 
    >>> for i in s: 
    ...     print i, '*' * s[i]
    ... 
    0 ***********************************************************************************************
    1 **************************************************************************************************************
    2 ******************************************************************************************************
    3 ***************************************************************************
    4 *************************************************************************************************************************
    5 ********************************************************************************
    6 **********************************************************************************************
    7 ***************************************************************************************
    8 ********************************************************************************************
    9 ********************************************************************************************
    10 ***********************************************************************************************
    11 ************************************************************************************************
    12 *******************************************************************************************************************
    13 *************************************************************************************************************
    14 ***************************************************************************************************************
    15 *****************************************************************************************************
    16 ********************************************************************************************************
    17 ****************************************************************************************************
    18 ************************************************************************************************
    19 **********************************************************************************
    20 ******************************************************************************************
    21 ********************************************************************************************************
    22 ******************************************************************************************************
    23 **********************************************************************************************************
    24 *******************************************************************************************************
    25 ******************************************************************************************
    26 ***************************************************************************************************************
    27 ***********************************************************************************************************
    28 *****************************************************************************************************
    29 ****************************************************************************************************************
    30 ********************************************************************************************************
    31 ********************************************************************************************
    32 ****************************************************************************************************
    33 **********************************************************************************************
    34 ****************************************************************************************************
    35 **************************************************************************************************
    36 *********************************************************************************************
    37 ***************************************************************************************
    38 *******************************************************************************************************
    39 **********************************************************************************************************
    40 ******************************************************************************************************
    41 ********************************************************************************************************
    42 ************************************************************************************
    43 ****************************************************************************************************************************
    44 ****************************************************************************************************************************
    45 ***********************************************************************************************
    46 *****************************************************************************************************
    47 ***************************************************************************************
    48 ***********************************************************************************************************
    49 ****************************************************************************************************************
    
    0 讨论(0)
  • 2021-01-13 15:11

    If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

    Here's an intuitive explanation for the algorithm.

    Process the file line by line.
    pick = line, with probability 1/N, where N = line number
    

    In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

    For an intuitive proof, imagine a file with 3 lines:

            1            Pick line 1.
           / \
         .5  .5
         /     \
        2       1        Switch to line 2?
       / \     / \
     .67 .33 .33 .67
     /     \ /     \
    2       3       1    Switch to line 3?
    

    The probability for each outcome:

    Line 1: .5 * .67     = 1/3
    Line 2: .5 * .67     = 1/3
    Line 3: .5 * .33 * 2 = 1/3
    

    From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

    Here's the Perl FAQ answer, adapted to your problem.

    use strict;
    use warnings;
    
    # Ignore 5 lines.
    <> for 1 .. 5;
    
    # Use reservoir sampling to select pairs from remaining lines.
    my (@picks, $n);
    until (eof){
        my @lines;
        $lines[$_] = <> for 0 .. 1;
    
        $n ++;
        @picks = @lines if rand($n) < 1;
    }
    
    print @picks;
    
    0 讨论(0)
  • 2021-01-13 15:11

    You may find perlfaq5 useful.

    0 讨论(0)
提交回复
热议问题