Pairwise haversine distance calculation

I have two arrays with lat and long. I want to calculate distance between every pair of lat and long with every other pair of lat and long in the array. Here are my two arrays.

lat_array

array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132,
        0.33370132,  0.33370132,  0.33371075,  0.33371075,  0.33370132,
        0.33370132,  0.33370132,  0.33356488,  0.33356488,  0.33370132,
        0.33370132,  0.33370132,  0.33401788,  0.33362632,  0.33362632,
        0.33364007,  0.33370132,  0.33401788,  0.33401788,  0.33358399,
        0.33358399,  0.33358399,  0.33370132,  0.33370132,  0.33362632,
        0.33370132,  0.33370132,  0.33370132,  0.33370132,  0.33370132,
        0.33356488,  0.33356456,  0.33391071,  0.33370132,  0.33356488,
        0.33356488,  0.33356456,  0.33356456,  0.33356456,  0.33362632,
        0.33364804,  0.3336314 ,  0.33370132,  0.33370132,  0.33370132,
        0.33364034,  0.33359921,  0.33370132,  0.33360397,  0.33348863,
        0.33370132])
long_array

array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27246931,  1.27246931,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27254305,  1.27254305,  1.2724337 ,
        1.2724337 ,  1.2724337 ,  1.27259085,  1.27250461,  1.27250461,
        1.27251211,  1.2724337 ,  1.27259085,  1.27259085,  1.27252134,
        1.27252134,  1.27252134,  1.2724337 ,  1.2724337 ,  1.27250461,
        1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27254305,  1.27253229,  1.27266808,  1.2724337 ,  1.27254305,
        1.27254305,  1.27253229,  1.27253229,  1.27253229,  1.27250461,
        1.27250534,  1.27250184,  1.2724337 ,  1.2724337 ,  1.2724337 ,
        1.27251339,  1.27223739,  1.2724337 ,  1.2722575 ,  1.27237575,
        1.2724337 ])

After conversion into radians. Now I want distance between first pair of lat and long with remaining pairs of lat and long and so on. And want to print the pairs and the corresponding distance.

This is what I am doing in python.

distance = []
R = 6371.0

for i in range(len(lat_array)):
   for j in (i+1,len(lat_array)):
      dlon = long_array[j]-long_array[i]
      dlat = lat_array[j]-lat_array[i]
      a = sin(dlat / 2)**2 + cos(lat_array[i]) * cos(lat_array[j]) *     
          sin(dlon / 2)**2
      c = 2 * atan2(sqrt(a), sqrt(1 - a))

      distance.append(R * c)

It gives me an error IndexError: index 56 is out of bounds for axis 0 with size 56 Where I am doing it wrong? And how to make calculation faster if the array is big? Please help.

You have a typo in the code. Change

for j in (i+1,len(lat_array)):

for j in range(i+1,len(lat_array)):

Otherwise you're iterating a tuple consisting of two elements i+1 and len(lat_array). The second causes the error.

Since this is currently Google's top result for "pairwise haversine distance" I'll add my two cents: This problem can be solved very quickly if you have access to scikit-learn. When looking at sklearn.metrics.pairwise_distances you'll note that the 'haversine' metric is not supported, however it is implemented in sklearn.neighbors.DistanceMetric.

This means you can do the following:

from sklearn.neighbors import DistanceMetric

def sklearn_haversine(lat, lon):
    haversine = DistanceMetric.get_metric('haversine')
    latlon = np.hstack((lat[:, np.newaxis], lon[:, np.newaxis]))
    dists = haversine.pairwise(latlon)
    return 6371 * dists

Note that the concatenation of lat and lon is only necessary since they are seperate arrays. If you would pass them as a combined array of shape (n_samples, 2) you could call haversine.pairwise directly on them. Furthermore multiplication by 6371 is also only necessary if you need the distances in kilometers. E.g. this step wouldn't be necessary if you wanted to simply find the closest pair of points.

Verification:

In [87]: lat = np.array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])

In [88]: lng = np.array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])

In [89]: sklearn_haversine(lat, lng)
Out[89]:
array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
       [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])

Performance:

In [91]: lat = np.random.randn(1000)

In [92]: lng = np.random.randn(1000)

In [93]: %timeit original_app(lat,lng)
1 loops, best of 3: 1.46 s per loop

In [94]: %timeit vectorized_app1(lat,lng)
10 loops, best of 3: 86.7 ms per loop

In [95]: %timeit vectorized_app2(lat,lng)
10 loops, best of 3: 75.7 ms per loop

In [96]: %timeit sklearn_haversine(lat,lng)
10 loops, best of 3: 76 ms per loop

In conclusion you can get the output of Divakar's vectorized_app1 with the speed of vectorized_app2 in shorter and simpler code.

Divakar

Assuming lat and lng as the lattitudes & longitudes arrays and that those have data in radians, here's one vectorized solution based upon this other solution -

# Elementwise differentiations for lattitudes & longitudes
dflat = lat[:,None] - lat
dflng = lng[:,None] - lng

# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))

Now, the above approach would give us output for all pairs irrespective of their order. Thus, we would have two distance outputs for the two pairs :(point1,point2) & (point2,point1), even though the distances would be the same. So, to save on memory and hopefully better performance, you can create unique paired IDs with np.triu_indices and modify the earlier listed approach, like so -

# Elementwise differentiations for lattitudes & longitudes, 
# but not repeat for the same paired elements
N = lat.size
idx1,idx2 = np.triu_indices(N,1)
dflat = lat[idx2] - lat[idx1]
dflng = lng[idx2] - lng[idx1]

# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[idx2])*np.cos(lat[idx1]) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))

Function definitions -

def original_app(lat,lng):
    distance = []
    R = 6371.0
    for i in range(len(lat)):
       for j in range(i+1,len(lat)):
          dlon = lng[j]-lng[i]
          dlat = lat[j]-lat[i]
          a = sin(dlat / 2)**2 + cos(lat[i]) * cos(lat[j]) * sin(dlon / 2)**2
          c = 2 * atan2(sqrt(a), sqrt(1 - a))
          distance.append(R * c)
    return distance

def vectorized_app1(lat,lng):                               
    dflat = lat[:,None] - lat
    dflng = lng[:,None] - lng
    d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

def vectorized_app2(lat,lng):                               
    N = lat.size
    idx1,idx2 = np.triu_indices(N,1)
    dflat = lat[idx2] - lat[idx1]
    dflng = lng[idx2] - lng[idx1]
    d =np.sin(dflat/2)**2+np.cos(lat[idx2])*np.cos(lat[idx1])*np.sin(dflng/2)**2
    return  2 * 6371 * np.arcsin(np.sqrt(d))

Verify output -

In [78]: lat
Out[78]: array([ 0.33356456,  0.33355585,  0.33355585,  0.33401788,  0.33370132])

In [79]: lng
Out[79]: array([ 1.27253229,  1.27249141,  1.27249141,  1.27259085,  1.2724337 ])

In [80]: original_app(lat,lng)
Out[80]: 
[0.2522702110418014,
 0.2522702110418014,
 2.909533226553249,
 1.0542204712876762,
 0.0,
 3.003834632906676,
 0.9897592295963831,
 3.003834632906676,
 0.9897592295963831,
 2.2276138997714474]

In [81]: vectorized_app1(lat,lng)
Out[81]: 
array([[ 0.        ,  0.25227021,  0.25227021,  2.90953323,  1.05422047],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 0.25227021,  0.        ,  0.        ,  3.00383463,  0.98975923],
       [ 2.90953323,  3.00383463,  3.00383463,  0.        ,  2.2276139 ],
       [ 1.05422047,  0.98975923,  0.98975923,  2.2276139 ,  0.        ]])

In [82]: vectorized_app2(lat,lng)
Out[82]: 
array([ 0.25227021,  0.25227021,  2.90953323,  1.05422047,  0.        ,
        3.00383463,  0.98975923,  3.00383463,  0.98975923,  2.2276139 ])

Runtime test -

In [83]: lat = np.random.randn(1000)

In [84]: lng = np.random.randn(1000)

In [85]: %timeit original_app(lat,lng)
1 loops, best of 3: 2.11 s per loop

In [86]: %timeit vectorized_app1(lat,lng)
1 loops, best of 3: 263 ms per loop

In [87]: %timeit vectorized_app2(lat,lng)
1 loops, best of 3: 224 ms per loop

Thus, for performance it seems vectorized_app2 might be the way to go!

来源：https://stackoverflow.com/questions/34557898/pairwise-haversine-distance-calculation

标签

python

arrays

performance

numpy

haversine