How to efficiently compare rows in a pandas DataFrame?

前端未结

关注

 4  1842

自闭症患者 2021-02-10 11:28

I have a pandas dataframe containing a record of lightning strikes with timestamps and global positions in the following format:

Index      Date      Time


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   自闭症患者
                                             
                
                
                (楼主)
            
              
              
                2021-02-10 12:14
              

            
            
                        
This answer might not be very efficient. I'm facing a very similar problem and am currently looking for something more efficient than what I do because it still takes one hour to compute on my dataframe (600k rows).

I first suggest you don't even think about using for loops like you do. You might not be able to avoid one (which is what I do using apply), but the second can (must) be vectorized.

The idea of this technique is to create a new column in the dataframe storing whether there is another strike nearby (temporarly and spatially).

First let's create a function calculating (with numpy package) the distances between one strike (reference) and all the others:

def get_distance(reference,other_strikes):

    radius = 6371.00085 #radius of the earth
    # Get lats and longs in radians, then compute deltas:
    lat1 = np.radians(other_strikes.Lat)
    lat2 = np.radians(reference[0])
    dLat = lat2-lat1
    dLon = np.radians(reference[1]) - np.radians(other_strikes.Lon)
    # And compute the distance (in km)
    a = np.sin(dLat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dLon / 2.0) ** 2
    return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius


Then create a function that will check whether, for one given strike, there is at least another nearby:

def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes):
    dmin = date_ref - np.timedelta64(delta_t,'D')
    dmax = date_ref + np.timedelta64(delta_t,'D')

    #Let's first find all strikes within a temporal range
    ind = other_strikes.Date.searchsorted([date_ref-delta_t,date_ref+delta_t])
    nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy()

    if len(nearby_strikes) == 0:
        return False

    #Let's compute spatial distance now:
    nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['Lat','Lon']])

    nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d]

    return (len(nearbystrikes)>0)


Now that all your functions are ready, you can use apply on your dataframe:

data['presence of nearby strike'] = data[['Date','Lat','Lon']].apply(lambda x: is_there_a_strike_nearby(x['Date'],x['Lat'],x['Long'], delta_t, delta_d,data)


And that's it, you have now created a new column in your dataframe that indicates whether your strike is isolated (False) or not (True), creating your new dataframe from this is easy.

The problem of this method is that it still is long to turn. There are ways to make it faster, for instance change is_there_a_strike_nearby to take as other arguments your data sorted by lat and long, and using other searchsorted to filter over Lat and Long before computing the distance (for instance if you want the strikes within a range of 10km, you can filter with a delta_Lat of 0.09).

Any feedback over this method is more than welcome!
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复