How to read specific lines of a large csv file

后端 未结 4 1078
被撕碎了的回忆
被撕碎了的回忆 2021-01-02 10:21

I am trying to read some specific rows of a large csv file, and I don\'t want to load the whole file into memory. The index of the specific rows are given in a list L

相关标签:
4条回答
  • 2021-01-02 10:59
    for row in enumerate(r):
    

    will pull tuples. You are then trying to select your ith element from a 2 element tuple.

    for example

    >> for i in enumerate({"a":1, "b":2}): print i
    (0, 'a')
    (1, 'b')
    

    Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:

    >>list({"a":1, "b":2, "c":3, "d":5})
    ['a', 'c', 'b', 'd']
    
    0 讨论(0)
  • 2021-01-02 11:08

    Assuming L is a list containing the line numbers you want, you could do :

    with open("~/file.csv") as f:
        r = csv.DictReader(f)
        for i, line in enumerate(r):
            if i in L:    # or (i+2) in L: from your second example
                print line
    

    That way :

    • you read the file only once
    • you do not load the whole file in memory
    • you only get the lines you are interested in

    The only caveat is that you read whole file even if L = [3]

    0 讨论(0)
  • 2021-01-02 11:11

    Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:

    count=0
    with open('~/file.csv') as f:
        r = csv.DictReader(f)
        for row in r:
            count += 1
            if L == []:
                break
            elif count == L[0]:
                print (row)
                L.pop(0)
    

    Note that this stops as soon as we've gone through L once.

    0 讨论(0)
  • 2021-01-02 11:19

    A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.

    Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:

    i=9
    row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})
    

    As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.

    Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15, then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):

    def read_my_lines(csv_reader, lines_list):
        for line_number, row in enumerate(csv_reader):
            if line_number in lines_list:
                yield line_number, row
    

    So when you want to process the lines, you would do:

    L = [2, 5, 15, 98, ...]
    with open('~/file.csv') as f:
        r = csv.DictReader(f)
        for line_number, line in read_my_lines(r, L):
            do_something_with_line(line)
    

    * Edit *

    This could further be improved to stop reading the file when you've read all the lines you wanted:

    def read_my_lines(csv_reader, lines_list):
        # make sure every line number shows up only once:
        lines_set = set(lines_list)
        for line_number, row in enumerate(csv_reader):
            if line_number in lines_set:
                yield line_number, row
                lines_set.remove(line_number)
                # Stop when the set is empty
                if not lines_set:
                    raise StopIteration
    
    0 讨论(0)
提交回复
热议问题