How to read specific lines of a large csv file

后端未结

关注

 4  1078

I am trying to read some specific rows of a large csv file, and I don\'t want to load the whole file into memory. The index of the specific rows are given in a list L


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2021-01-02 10:59
              
            
            
                                                                       
for row in enumerate(r):


will pull tuples. You are then trying to select your ith element from a 2 element tuple.

for example 

>> for i in enumerate({"a":1, "b":2}): print i
(0, 'a')
(1, 'b')


Additionally, since dictionaries are hash tables, your initial order is not necessarily preserved. for instance:

>>list({"a":1, "b":2, "c":3, "d":5})
['a', 'c', 'b', 'd']

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲&欢浪女        
                
              
                            
                2021-01-02 11:08
              
            
            
                                                                       
Assuming L is a list containing the line numbers you want, you could do :

with open("~/file.csv") as f:
    r = csv.DictReader(f)
    for i, line in enumerate(r):
        if i in L:    # or (i+2) in L: from your second example
            print line


That way :


you read the file only once
you do not load the whole file in memory
you only get the lines you are interested in


The only caveat is that you read whole file even if L = [3]
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2021-01-02 11:11
              
            
            
                                                                       
Just to sum up the great ideas, I ended up using something like this: L can be sorted relatively quickly, and in my case it was actually already sorted. So, instead of several membership checks in L it pays off to sort it and then only check each index against the first entry of it. Here is my piece of code:

count=0
with open('~/file.csv') as f:
    r = csv.DictReader(f)
    for row in r:
        count += 1
        if L == []:
            break
        elif count == L[0]:
            print (row)
            L.pop(0)


Note that this stops as soon as we've gone through L once.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦如初夏        
                
              
                            
                2021-01-02 11:19
              
            
            
                                                                       
A file doesn't have "lines" or "rows". What you consider a "line" is "what is found between two newline characters". As such you cannot read the nth line without reading the lines before it, as you couldn't count the newline characters.

Answer 1: if you consider your example, but with L=[9], unrolling your loops would give:

i=9
row = (0, {'Col 2': 'row12', 'Col 3': 'row13', 'Col 1': 'row11'})


As you can see, row is a tuple with two members, calling row[i] means row[9], hence the IndexError.

Answer 2: This is very slow because you are reading the file up to the line number every time. In your example, you read the first 2 lines, then the first 5, then the first 15,  then the first 98, etc. So you've read the first 5 lines 3 times. You could create a generator that only returns the lines you want (beware, line numbers would be 0-indexed):

def read_my_lines(csv_reader, lines_list):
    for line_number, row in enumerate(csv_reader):
        if line_number in lines_list:
            yield line_number, row


So when you want to process the lines, you would do:

L = [2, 5, 15, 98, ...]
with open('~/file.csv') as f:
    r = csv.DictReader(f)
    for line_number, line in read_my_lines(r, L):
        do_something_with_line(line)


* Edit *

This could further be improved to stop reading the file when you've read all the lines you wanted:

def read_my_lines(csv_reader, lines_list):
    # make sure every line number shows up only once:
    lines_set = set(lines_list)
    for line_number, row in enumerate(csv_reader):
        if line_number in lines_set:
            yield line_number, row
            lines_set.remove(line_number)
            # Stop when the set is empty
            if not lines_set:
                raise StopIteration

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复