How to split/expand a string value into several pandas DataFrame rows?

前端未结

关注

 3  1502

Let\'s say my DataFrame df is created like this:

df = pd.DataFrame({\"title\" : [\"Robin Hood\", \"Madagaskar\"],
                  \"genres\" :


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2020-11-27 23:01
              
            
            
                                                                       
Since pandas >= 0.25.0 we have a native method for this called explode.

This method unnests each element in a list to a new row and repeats the other columns.

So first we have to call Series.str.split on our string value to split the string to list of elements.

>>> df.assign(genres=df['genres'].str.split(', ')).explode('genres')

        title     genres
0  Robin Hood     Action
0  Robin Hood  Adventure
1  Madagaskar     Family
1  Madagaskar  Animation
1  Madagaskar     Comedy

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2020-11-27 23:11
              
            
            
                                                                       
You can use np.repeat with numpy.concatenate for flattening.

splitted = df['genres'].str.split(',\s*')
l = splitted.str.len()

df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
                     'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
print (df1)
        title      genres
0  Robin Hood      Action
1  Robin Hood   Adventure
2  Madagaskar      Family
3  Madagaskar   Animation
4  Madagaskar      Comedy


Timings:

df = pd.concat([df]*100000).reset_index(drop=True)

In [95]: %%timeit
    ...: splitted = df['genres'].str.split(',\s*')
    ...: l = splitted.str.len()
    ...: 
    ...: df1 = pd.DataFrame({'title': np.repeat(df['title'].values, l),
    ...:                      'genres':np.concatenate(splitted.values)}, columns=['title','genres'])
    ...: 
    ...: 
1 loop, best of 3: 709 ms per loop

In [96]: %timeit (df.set_index('title')['genres'].str.split(',\s*', expand=True).stack().reset_index(name='genre').drop('level_1',1))
1 loop, best of 3: 750 ms per loop

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暗喜        
                
              
                            
                2020-11-27 23:15
              
            
            
                                                                       
In [33]: (df.set_index('title')
            ['genres'].str.split(',\s*', expand=True)
            .stack()
            .reset_index(name='genre')
            .drop('level_1',1))
Out[33]:
        title      genre
0  Robin Hood     Action
1  Robin Hood  Adventure
2  Madagaskar     Family
3  Madagaskar  Animation
4  Madagaskar     Comedy


PS here you can find more generic approach.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复