Read random lines from huge CSV file in Python

前端 未结 11 1437
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

相关标签:
11条回答
  • 2020-12-05 03:05

    You can use a variation of the probabilistic method for choosing a random line in a file.

    Instead of just keeping a single number that gets chosen, you can keep a buffer of size C. For each line number, n, in the file with N lines, you want to choose that line with probability C/n (rather than the original 1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.

    Here's how it works:

    import random
    
    C = 2
    fpath = 'somelines.txt'
    buffer = []
    
    f = open(fpath, 'r')
    for line_num, line in enumerate(f):
        n = line_num + 1.0
        r = random.random()
        if n <= C:
            buffer.append(line.strip())
        elif r < C/n:
            loc = random.randint(0, C-1)
            buffer[loc] = line.strip()
    

    This requires a single pass through the file (so it's linear time) and returns exactly C lines from the file. Each line will have probability C/N of being selected.

    To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:

    a,b: 1046
    b,c: 1018
    b,e: 1014
    a,c: 1003
    c,d: 1002
    d,e: 1000
    c,e: 993
    a,e: 992
    a,d: 985
    b,d: 947
    
    0 讨论(0)
  • 2020-12-05 03:06

    I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

    Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

    This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).

    chances_selected = desired_num_results / total_entries
    for line in csv.reader(file):
       if random() < chances_selected:
            result.append(line)
    
    0 讨论(0)
  • 2020-12-05 03:06

    Another solution is possible if you know the total number of lines - generate 1 million random numbers (random.sample(xrange(n), 1000000)) up to the total number of lines as a set, then use:

    for i, line in enumerate(csvfile):
        if i in lines_to_grab:
            yield line
    

    This will get you exactly 1 million lines in an unbiased way, but you need to have the number of lines beforehand.

    0 讨论(0)
  • You can rewrite the file with fixed-length records, and then perform random access on the intermediate file later:

    ifile = file.open("inputfile.csv")
    ofile = file.open("intermediatefile.csv",'w')
    for line in ifile:
        ofile.write(line.rstrip('\n').ljust(15)+'\n')
    

    Then, you can do:

    import random
    ifile = file.open("intermediatefile.csv")
    lines = []
    samples = random.sample(range(nlines))
    for sample in samples:
        ifile.seek(sample)
        lines.append(ifile.readline())
    

    Requires more disk space, and the first program may take some time to run, but it allows unlimited later random access to records with the second.

    0 讨论(0)
  • 2020-12-05 03:07

    In this method, we generate a random number set whose number of elements is equal to the number of lines to be read, with its range being the number of rows present in the data. It is then sorted from smallest to largest and stored.

    Then the csv file is read line by line and a line_counter is in place to denote the row number. This line_counter is then checked with the first element of the sorted random number list and if they are same then that specific line is written into the new csv file and the first element is removed from the list and the previously second element takes the place of the first and the cycle continues.

    import random
    k=random.sample(xrange(No_of_rows_in_data),No_of_lines_to_be_read)
    Num=sorted(k)    
    line_counter = 0
    
    with open(input_file,'rb') as file_handle:
        reader = csv.reader(file_handle)
        with open(output_file,'wb') as outfile:
                a=csv.writer(outfile)
                for line in reader:
                    line_counter += 1
                    if line_counter == Num[0]:
                    a.writerow(line)
                    Num.remove(Num[0])
                    if len(Num)==0:
                    break    
    
    0 讨论(0)
提交回复
热议问题