Data loading using arrays in Python

前端未结

关注

 3  1359

Have a data in such format in .txt file:

UserId   WordID
  1       20
  1       30
  1       40
  2       25
  2       16
  3       56
  3       44
  3       12


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2021-01-21 20:37
              
            
            
                                                                       
while you might be more interested in doing it in pandas depending on your purpose, the numpy way would be:

userid,wordid = np.loadtxt('/data/file.txt',skiprows=1,unpack=True)
#example use:
mylist = []
for uid in np.unique(userid):
    mylist.append(wordid[userid==uid])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2021-01-21 20:42
              
            
            
                                                                       
I think you can use groupby with apply tolist with values:

print df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
[[20, 30, 40] [25, 16] [56, 44, 12]]


Or apply list, thank you B.M.

print df.groupby('UserId')['WordID'].apply(list).values
[[20, 30, 40] [25, 16] [56, 44, 12]]


Timings:

df = pd.concat([df]*1000).reset_index(drop=True)

In [358]: %timeit df.groupby('UserId')['WordID'].apply(list).values
1000 loops, best of 3: 1.22 ms per loop

In [359]: %timeit df.groupby('UserId')['WordID'].apply(lambda x: x.tolist()).values
1000 loops, best of 3: 1.23 ms per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  广开言路        
                
              
                            
                2021-01-21 20:46
              
            
            
                                                                       
If you are concerned with performance issues, like often numpy is better :

df=pd.read_csv('file.txt')
def numpyway():
    u,v=df.values.T
    ind=argsort(u,kind='mergesort') # stable sort to preserve order
    return np.split(v[ind],add(1,*where(diff(u[ind]))))


In [12]: %timeit numpyway() # on 8000 lines
10000 loops, best of 3: 250 µs per loop


If 'UserId' is already sorted, it is yet three times faster. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复