Vectorizing Haversine distance calculation in Python

后端 未结 3 1464
傲寒
傲寒 2020-11-29 11:15

I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinat

相关标签:
3条回答
  • 2020-11-29 11:52

    start by getting all combinations using itertools.product

     results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]
    

    that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison

    0 讨论(0)
  • 2020-11-29 12:01

    From haversine's function definition, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka broadcasting and replacing the math funcs with the NumPy equivalents ufuncs, here's one vectorized solution -

    # Get data as a Nx2 shaped NumPy array
    data = np.array(df['coordinates'].tolist())
    
    # Convert to radians
    data = np.deg2rad(data)                     
    
    # Extract col-1 and 2 as latitudes and longitudes
    lat = data[:,0]                     
    lng = data[:,1]         
    
    # Elementwise differentiations for lattitudes & longitudes
    diff_lat = lat[:,None] - lat
    diff_lng = lng[:,None] - lng
    
    # Finally Calculate haversine
    d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))
    

    Runtime tests -

    The other np.vectorize based solution has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.

    Function definitions -

    def vectotized_based(df):
        haver_vec = np.vectorize(haversine, otypes=[np.int16])
        return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
    
    def broadcasting_based(df):
        data = np.array(df['coordinates'].tolist())
        data = np.deg2rad(data)                     
        lat = data[:,0]                     
        lng = data[:,1]         
        diff_lat = lat[:,None] - lat
        diff_lng = lng[:,None] - lng
        d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
        return 2 * 6371 * np.arcsin(np.sqrt(d))
    

    Timings -

    In [123]: # Input
         ...: length = 500
         ...: d1 = np.random.uniform(-90, 90, length)
         ...: d2 = np.random.uniform(-180, 180, length)
         ...: coords = tuple(zip(d1, d2))
         ...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
         ...: 
    
    In [124]: %timeit vectotized_based(df)
    1 loops, best of 3: 1.12 s per loop
    
    In [125]: %timeit broadcasting_based(df)
    10 loops, best of 3: 68.7 ms per loop
    
    0 讨论(0)
  • 2020-11-29 12:06

    You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

    haver_vec = np.vectorize(haversine, otypes=[np.int16])
    distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
    

    For instance, with sample data as follows:

    length = 500
    df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
    

    compare for 500 points:

    def haver_vect(data):
        distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
        return distance
    
    %timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
    
    %timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop
    
    0 讨论(0)
提交回复
热议问题