out of memory error when reading csv file in chunk

后端未结

关注

 2  565

I am processing a csv-file which is 2.5 GB big. The 2.5 GB table looks like this:

columns=[ka,kb_1,kb_2,timeofEvent,timeInterva


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-01-02 19:40
              
            
            
                                                                       
Based on your snippet, when reading line-by-line.

I assume that kb_2 is the error indicator,

groups={}
with open("data/petaJoined.csv", "r") as large_file:
    for line in large_file:
        arr=line.split('\t')
        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval
        k=arr[0]+','+arr[1]
        if not (k in groups.keys())
            groups[k]={'record_count':0, 'error_sum': 0}
        groups[k]['record_count']=groups[k]['record_count']+1
        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])
for k,v in groups.items:
    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))


This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-02 19:40
              
            
            
                                                                       
Q: Anyone knows what is happening?

A: Yes. Sum of all data memory-overheads for in-RAM objects !< RAM

It is a natural part of any formal abstraction to add some additional overhead in case some additional features are to be implemented on a higher ( a more abstract ) layer. That means that the more abstract / the more feature-rich representation of any dataset was chosen, the more memory- & processing-overheads are to be expected.

ITEMasINT             =                                32345
ITEMasTUPLE           =                              ( 32345, )
ITEMasLIST            =                              [ 32345, ]
ITEMasARRAY           = np.array(                    [ 32345, ] )
ITEMasDICT            =                         { 0:   32345, }


######## .__sizeof__()             -> int\nsize of object in memory, in bytes'
ITEMasINT.__sizeof__()             ->   12 #_____ 100% _ trivial INT
ITEMasTUPLE.__sizeof__()           ->   16 #      133% _ en-tuple-d
ITEMasLIST.__sizeof__()            ->   24 #      200% _ list-ed
ITEMasARRAY.__sizeof__()           ->   40 #      333% _ numpy-wrapped
ITEMasDICT.__sizeof__()            ->  124 #     1033% _ hash-associated asDict


If a personal experience is not enough, check the "costs" of re-wrapping the input ( already not small ) data into pandas overheads:

CParserError: Error tokenizing data. C error: out of memory
Segmentation fault (core dumped)


and

CParserError: Error tokenizing data. C error: out of memory
*** glibc detected *** python: free(): ...
...
..
.
Aborted (core dumped)


Q: Maybe there will be a solution?

A: Yes.

Simply follow the computational strategy and deploy memory-efficient & fast processing of the csv-input ( it's still a fileIO having some 8-15 ms access time and quite a low performance stream data-flow, even if you use SSD-devices with about 960MB/s peak-transfer rate, your blocking-fact is the memory-allocation limit ... so rather be patient on input-stream and do not crash into a principal memory-barrier for any in-RAM super-object ( which would have been introduced just to be finally asked ( if it did not crash during it's instantiation ... ) to compute a plain sum/nROWs ).

A line-by-line or block-arranged reads allow you to calculate results on-the-fly and using a register-based ( asDict and alike for an interim storage of results ) sliding-window computation strategy is both fast and memory-efficient. ( Uri has provided an example for such )



This principal approach is used to be used in both real-time constrained systems and for system-on-chip designs, that were used for processing large data-streams for more than the last half century, so nothing new uder the Sun.



In case your results's size cannot fit in RAM, than it makes no sense to even start the processing of any input file, does it?

Processing BigData is neither about super-up-scaling of the COTS-dataObjects nor about finding a best or a most sexy "one-liner" ...

BigData requires a lot of understanding of the way how to process both fast and smart so as to avoid extreme costs of even small overheads, that are forgiving to do principal mistakes on just a few GB-s of small-bigData but will kill anyone's budget & efforts once trying the same on a larger playground.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复