Multiple values in single column of a pandas DataFrame

前端未结

关注

 1  1813

I have some data that I\'m parsing from XML to a pandas DataFrame. The XML data roughly looks like this:


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-14 02:30
              
            
            
                                                                       
Assuming you have enough memory, your task will be more easily accomplished if your DataFrame held one variant per row:

track_name     variants  time     route_id  stop_id  serial
"trackname1"   1         "21:23"         5      103       1
"trackname1"   2         "21:23"         5      103       1
"trackname1"   3         "21:23"         5      103       1
"trackname1"   1         "21:26"         5       17       2
"trackname1"   2         "21:26"         5       17       2
"trackname1"   3         "21:26"         5       17       2
...
"trackname1"   4         "21:20"         5      103       1
"trackname1"   5         "21:20"         5      103       1
...
"trackname2"   1         "20:59"         3       45       1


Then you could find "all rows for variant 3 on route_id 5 with

df.loc[(df['variants']==3) & (df['route_id']==5)]


If you pack many variants into one row, such as

"trackname1"   "1,2,3"   "21:23"  "5"       "103"    "1"


then you could find such rows using 

df.loc[(df['variants'].str.contains("3")) & (df['route_id']=="5")]


assuming that the variants are always single digits. If there are also 2-digit variants like "13" or "30", then you would need to pass a more complicated regex pattern to str.contains. 

Alternatively, you could use apply to split each variant on commas:

df['variants'].apply(lambda x: "3" in x.split(','))


but this is very inefficent since you would now be calling a Python function
once for every row, and doing string splitting and a test for membership in a
list compared to a vectorized integer comparision.

Thus, to avoid possibly complicated regex or a relatively slow call to apply, I think your best bet is to build the DataFrame with one integer variant per row.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复