Efficient computation of minimum of Haversine distances

后端 未结 1 1639
旧时难觅i
旧时难觅i 2021-01-13 22:41

I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I\'m trying to return the minimum

1条回答
  •  傲寒
    傲寒 (楼主)
    2021-01-13 23:13

    The haversine func in essence is :

    # convert all latitudes/longitudes from decimal degrees to radians
    lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))
    
    # calculate haversine
    lat = lat2 - lat1
    lng = lng2 - lng1
    
    d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
    

    Here's a vectorized method leveraging the powerful NumPy broadcasting and NumPy ufuncs to replace those math-module funcs so that we would operate on entire arrays in one go -

    # Get array data; convert to radians to simulate 'map(radians,...)' part    
    coords_arr = np.deg2rad(coords_list)
    a = np.deg2rad(df.values)
    
    # Get the differentiations
    lat = coords_arr[:,0] - a[:,0,None]
    lng = coords_arr[:,1] - a[:,1,None]
    
    # Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
    # Add into "sin(lat * 0.5) ** 2" part.
    add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
    d = np.sin(lat * 0.5) ** 2 +  add0
    
    # Get h and assign into dataframe
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    df['Min_Distance'] = h.min(1)
    

    For further performance boost, we can make use of numexpr module to replace the transcendental funcs.


    Runtime test and verification

    Approaches -

    def loopy_app(df, coords_list):
        for row in df.itertuples():
            df['Min_Distance1'] = df.apply(min_distance, axis=1)
    
    def vectorized_app(df, coords_list):   
        coords_arr = np.deg2rad(coords_list)
        a = np.deg2rad(df.values)
    
        lat = coords_arr[:,0] - a[:,0,None]
        lng = coords_arr[:,1] - a[:,1,None]
    
        add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
        d = np.sin(lat * 0.5) ** 2 +  add0
    
        h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
        df['Min_Distance2'] = h.min(1)
    

    Verification -

    In [158]: df
    Out[158]: 
       Latitude  Longitude
    0    39.989    -89.980
    1    39.923    -89.901
    2    39.990    -89.987
    3    39.884    -89.943
    4    39.030    -89.931
    
    In [159]: loopy_app(df, coords_list)
    
    In [160]: vectorized_app(df, coords_list)
    
    In [161]: df
    Out[161]: 
       Latitude  Longitude  Min_Distance1  Min_Distance2
    0    39.989    -89.980     126.637607     126.637607
    1    39.923    -89.901     121.266241     121.266241
    2    39.990    -89.987     126.037388     126.037388
    3    39.884    -89.943     118.901195     118.901195
    4    39.030    -89.931      53.765506      53.765506
    

    Timings -

    In [163]: df
    Out[163]: 
       Latitude  Longitude
    0    39.989    -89.980
    1    39.923    -89.901
    2    39.990    -89.987
    3    39.884    -89.943
    4    39.030    -89.931
    
    In [164]: %timeit loopy_app(df, coords_list)
    100 loops, best of 3: 2.41 ms per loop
    
    In [165]: %timeit vectorized_app(df, coords_list)
    10000 loops, best of 3: 96.8 µs per loop
    

    0 讨论(0)
提交回复
热议问题