pandas groupby with sum() on large csv file?

前端未结

关注

 2  652

I have a big file (19GB or so) that I want to load in memory to perform an aggregation over some columns.

the file looks like this:

id, col1, col2,


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-10 06:33
              
            
            
                                                                       
Firstly you can choose list of unique constants by read csv with usecols - usecols=['id', 'col1']. Then read csv by chunks, concat chunks by subset of id and groupby.  better explain.

If better is use column col1, change constants = df['col1'].unique().tolist(). It depends on your data.

Or you can read only one column df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id']), it depends on your data.

import pandas as pd
import numpy as np
import io

#test data
temp=u"""id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213"""
df = pd.read_csv(io.StringIO(temp), sep=",", usecols=['id', 'col1'])
#drop duplicities, from out you can choose constant
df = df.drop_duplicates()
print df
#   id  col1
#0   1    13
#2   1    12
#3   2    18
#9   3    14

#for example list of constants
constants = [1,2,3]
#or column id to list of unique values
constants = df['id'].unique().tolist()
print constants
#[1L, 2L, 3L]

for i in constants:
    iter_csv = pd.read_csv(io.StringIO(temp), delimiter=",", chunksize=10)
    #concat subset with rows id == constant
    df = pd.concat([chunk[chunk['id'] == i] for chunk in iter_csv])
    #your groupby function
    data = df.reset_index(drop=True).groupby(["id","col1"], as_index=False).sum()
    print data.to_csv(index=False)

    #id,col1,col2,col3
    #1,12,15,13
    #1,13,30,28
    #
    #id,col1,col2,col3
    #2,18,90,78
    #
    #id,col1,col2,col3
    #3,14,215,239

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-12-10 06:39
              
            
            
                                                                       
dask solution

Dask.dataframe can almost do this without modification

$ cat so.csv
id,col1,col2,col3
1,13,15,14
1,13,15,14
1,12,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
2,18,15,13
3,14,15,13
3,14,15,13
3,14,185,213

$ pip install dask[dataframe]
$ ipython

In [1]: import dask.dataframe as dd

In [2]: df = dd.read_csv('so.csv', sep=',')

In [3]: df.head()
Out[3]: 
   id  col1  col2  col3
0   1    13    15    14
1   1    13    15    14
2   1    12    15    13
3   2    18    15    13
4   2    18    15    13

In [4]: df.groupby(['id', 'col1']).sum().compute()
Out[4]: 
         col2  col3
id col1            
1  12      15    13
   13      30    28
2  18      90    78
3  14     215   239


No one has written as_index=False for groupby though.  We can work around this with assign.

In [5]: df.assign(id_2=df.id, col1_2=df.col1).groupby(['id_2', 'col1_2']).sum().compute()
Out[5]: 
             id  col1  col2  col3
id_2 col1_2                      
1    12       1    12    15    13
     13       2    26    30    28
2    18      12   108    90    78
3    14       9    42   215   239


How this works

We'll pull out chunks and do groupbys just like in your first example.  Once we're done grouping and summing each of the chunks we'll gather all of the intermediate results together and do another slightly different groupby.sum.  This makes the assumption that the intermediate results will fit in memory. 

Parallelism

As a pleasant side effect, this will also operate in parallel.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复