How to stream in and manipulate a large data file in python

前端未结

关注

 2  800

I have a relatively large (1 GB) text file that I want to cut down in size by summing across categories:

Geography AgeGroup Gender Race Count
County1   1


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2021-02-04 15:44
              
            
            
                                                                       
You can use dask.dataframe, which is syntactically similar to pandas, but performs manipulations out-of-core, so memory shouldn't be an issue:

import dask.dataframe as dd

df = dd.read_csv('my_file.csv')
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')


Alternatively, if pandas is a requirement you can use chunked reads, as mentioned by @chrisaycock.  You may want to experiment with the chunksize parameter.

# Operate on chunks.
data = []
for chunk in pd.read_csv('my_file.csv', chunksize=10**5):
    chunk = chunk.groupby('Geography', as_index=False)['Count'].sum()
    data.append(chunk)

# Combine the chunked data.
df = pd.concat(data, ignore_index=True)
df = df.groupby('Geography')['Count'].sum().to_frame()
df.to_csv('my_output.csv')

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2021-02-04 15:57
              
            
            
                                                                       
I do like @root's solution, but i would go bit further optimizing memory usage - keeping only aggregated DF in memory and reading only those columns, that you really need:

cols = ['Geography','Count']
df = pd.DataFrame()

chunksize = 2   # adjust it! for example --> 10**5
for chunk in (pd.read_csv(filename,
                          usecols=cols,
                          chunksize=chunksize)
             ):
    # merge previously aggregated DF with a new portion of data and aggregate it again
    df = (pd.concat([df,
                     chunk.groupby('Geography')['Count'].sum().to_frame()])
            .groupby(level=0)['Count']
            .sum()
            .to_frame()
         )

df.reset_index().to_csv('c:/temp/result.csv', index=False)


test data:

Geography,AgeGroup,Gender,Race,Count
County1,1,M,1,12
County2,2,M,1,3
County3,2,M,2,0
County1,1,M,1,12
County2,2,M,1,33
County3,2,M,2,11
County1,1,M,1,12
County2,2,M,1,111
County3,2,M,2,1111
County5,1,M,1,12
County6,2,M,1,33
County7,2,M,2,11
County5,1,M,1,12
County8,2,M,1,111
County9,2,M,2,1111


output.csv:

Geography,Count
County1,36
County2,147
County3,1122
County5,24
County6,33
County7,11
County8,111
County9,1111


PS using this approach will you can process huge files.

PPS using chunking approach should work unless you need to sort your data - in this case i would use classic UNIX tools, like awk, sort, etc. for sorting your data first

I would also recommend to use PyTables (HDF5 Storage), instead of CSV files - it is very fast and allows you to read data conditionally (using where parameter), so it's very handy and saves a lot of resources and usually much faster compared to CSV.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复