Operations on a very large csv with pandas

前端未结

关注

 2  1059

I have been using pandas on csv files to get some values out of them. My data looks like this:

\"A\",23.495,41.995,\"this is a sentence with some words\"
\"B\",


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2021-01-29 09:52
              
            
            
                                                                       
Okay I misunderstood the chunk parameter. I solved it by doing this: 

frame = pd.DataFrame()
chunks = pd.read_csv("csvfile.txt", sep=",", header = None,names=
["group","val1","val2","text"],chunksize=1000000)
for df in chunks: 
    freq=Counter(df['group'])
    word1=df[df["text"].str.contains("WORD1")].groupby("group").size()
    word2=df[df["text"].str.contains("WORD2")].groupby("group").size()
    df1 = pd.concat([pd.Series(freq),word1,word2], axis=1)
    frame = frame.add(df1,fill_value=0)

outfile = open("csv_out.txt","w", encoding='utf-8')
frame.to_csv(outfile, sep=",")
outfile.close() 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-01-29 10:16
              
            
            
                                                                       
You can specify a chunksize option in the read_csv call. See here for details

Alternatively you could use the Python csv library and create your own csv Reader or DictReader and then use that to read in data in whatever chunk size you choose. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复