Read random lines from huge CSV file in Python

前端 未结 11 1439
独厮守ぢ
独厮守ぢ 2020-12-05 02:27

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows t

相关标签:
11条回答
  • 2020-12-05 02:49

    If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don't mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina's answer).

    # Overhead:
    # Read the line locations into memory once.  (If the lines are long,
    # this should take substantially less memory than the file itself.)
    fname = 'big_file'
    s = [0]
    linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
    f = open(fname) # Reopen the file.
    
    # Each subsequent iteration uses only the code below:
    # Grab a 1,000,000 line sample
    # I sorted these because I assume the seeks are faster that way.
    chosen = sorted(random.sample(linelocs, 1000000))
    sampleLines = []
    for offset in chosen:
      f.seek(offset)
      sampleLines.append(f.readline())
    # Now we can randomize if need be.
    random.shuffle(sampleLines)
    
    0 讨论(0)
  • 2020-12-05 02:49
    # pass 1, count the number of rows in the file
    rowcount = sum(1 for line in file)
    # pass 2, select random lines
    file.seek(0)
    remaining = 1000000
    for row in csv.reader(file):
        if random.randrange(rowcount) < remaining:
            print row
            remaining -= 1
        rowcount -= 1
    
    0 讨论(0)
  • 2020-12-05 02:50

    If you can place this data in a sqlite3 database, selecting some number of random rows is trivial. You will not need to pre-read or pad lines in the file. Since sqlite data files are binary, you data file will be 1/3 to 1/2 smaller than CSV text.

    You can use a script like THIS to import the CSV file or, better still, just write your data to a database table in the first place. SQLITE3 is part of the Python distribution.

    Then use these statements to get 1,000,000 random rows:

    mydb='csv.db'
    con=sqlite3.connect(mydb)
    
    with con:
        cur=con.cursor()
        cur.execute("SELECT * FROM csv ORDER BY RANDOM() LIMIT 1000000;")
    
        for row in cur.fetchall():
            # now you have random rows...
    
    0 讨论(0)
  • 2020-12-05 02:51

    If you can use pandas and numpy, I have posted a solution in another question that is pandas specific but very efficient:

    import pandas as pd
    import numpy as np
    
    filename = "data.csv"
    sample_size = 1000000
    batch_size = 5000
    
    rng = np.random.default_rng()
    
    sample_reader = pd.read_csv(filename, dtype=str, chunksize=batch_size)
    
    sample = sample_reader.get_chunk(sample_size)
    
    for chunk in sample_reader:
        chunk.index = rng.integers(sample_size, size=len(chunk))
        sample.loc[chunk.index] = chunk
    

    For more details, please see the other answer.

    0 讨论(0)
  • If the lines are truly .csv format and NOT fixed field, then no, there's not. You can crawl through the file once, indexing the byte offsets for each line, then when later needed only use the index set, but there's no way to a priori predict the exact location of the line-terminating \n character for arbitrary csv files.

    0 讨论(0)
  • 2020-12-05 03:03
    import random
    
    filesize = 1500                 #size of the really big file
    offset = random.randrange(filesize)
    
    f = open('really_big_file')
    f.seek(offset)                  #go to random position
    f.readline()                    # discard - bound to be partial line
    random_line = f.readline()      # bingo!
    
    # extra to handle last/first line edge cases
    if len(random_line) == 0:       # we have hit the end
        f.seek(0)
        random_line = f.readline()  # so we'll grab the first line instead
    

    As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

    Let's assume (in this case) we have min=3 and max=15

    1) Find the length (Lp) of the previous line.

    Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

    We accomplish this by randomly keeping the line X% of the time where:

    X = min / Lp

    If we don't keep the line, we do another random pick until our dice roll comes good. :-)

    0 讨论(0)
提交回复
热议问题