问题
I have two arrays with lat and long. I want to calculate distance between every pair of lat and long with every other pair of lat and long in the array. Here are my two arrays.
lat_array
array([ 0.33356456, 0.33355585, 0.33355585, 0.33401788, 0.33370132,
0.33370132, 0.33370132, 0.33371075, 0.33371075, 0.33370132,
0.33370132, 0.33370132, 0.33356488, 0.33356488, 0.33370132,
0.33370132, 0.33370132, 0.33401788, 0.33362632, 0.33362632,
0.33364007, 0.33370132, 0.33401788, 0.33401788, 0.33358399,
0.33358399, 0.33358399, 0.33370132, 0.33370132, 0.33362632,
0.33370132, 0.33370132, 0.33370132, 0.33370132, 0.33370132,
0.33356488, 0.33356456, 0.33391071, 0.33370132, 0.33356488,
0.33356488, 0.33356456, 0.33356456, 0.33356456, 0.33362632,
0.33364804, 0.3336314 , 0.33370132, 0.33370132, 0.33370132,
0.33364034, 0.33359921, 0.33370132, 0.33360397, 0.33348863,
0.33370132])
long_array
array([ 1.27253229, 1.27249141, 1.27249141, 1.27259085, 1.2724337 ,
1.2724337 , 1.2724337 , 1.27246931, 1.27246931, 1.2724337 ,
1.2724337 , 1.2724337 , 1.27254305, 1.27254305, 1.2724337 ,
1.2724337 , 1.2724337 , 1.27259085, 1.27250461, 1.27250461,
1.27251211, 1.2724337 , 1.27259085, 1.27259085, 1.27252134,
1.27252134, 1.27252134, 1.2724337 , 1.2724337 , 1.27250461,
1.2724337 , 1.2724337 , 1.2724337 , 1.2724337 , 1.2724337 ,
1.27254305, 1.27253229, 1.27266808, 1.2724337 , 1.27254305,
1.27254305, 1.27253229, 1.27253229, 1.27253229, 1.27250461,
1.27250534, 1.27250184, 1.2724337 , 1.2724337 , 1.2724337 ,
1.27251339, 1.27223739, 1.2724337 , 1.2722575 , 1.27237575,
1.2724337 ])
After conversion into radians. Now I want distance between first pair of lat and long with remaining pairs of lat and long and so on. And want to print the pairs and the corresponding distance.
This is what I am doing in python.
distance = []
R = 6371.0
for i in range(len(lat_array)):
for j in (i+1,len(lat_array)):
dlon = long_array[j]-long_array[i]
dlat = lat_array[j]-lat_array[i]
a = sin(dlat / 2)**2 + cos(lat_array[i]) * cos(lat_array[j]) *
sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance.append(R * c)
It gives me an error IndexError: index 56 is out of bounds for axis 0 with size 56
Where I am doing it wrong? And how to make calculation faster if the array is big? Please help.
回答1:
You have a typo in the code. Change
for j in (i+1,len(lat_array)):
to
for j in range(i+1,len(lat_array)):
Otherwise you're iterating a tuple consisting of two elements i+1
and len(lat_array)
. The second causes the error.
回答2:
Since this is currently Google's top result for "pairwise haversine distance" I'll add my two cents: This problem can be solved very quickly if you have access to scikit-learn
. When looking at sklearn.metrics.pairwise_distances you'll note that the 'haversine' metric is not supported, however it is implemented in sklearn.neighbors.DistanceMetric.
This means you can do the following:
from sklearn.neighbors import DistanceMetric
def sklearn_haversine(lat, lon):
haversine = DistanceMetric.get_metric('haversine')
latlon = np.hstack((lat[:, np.newaxis], lon[:, np.newaxis]))
dists = haversine.pairwise(latlon)
return 6371 * dists
Note that the concatenation of lat
and lon
is only necessary since they are seperate arrays. If you would pass them as a combined array of shape (n_samples, 2)
you could call haversine.pairwise
directly on them. Furthermore multiplication by 6371
is also only necessary if you need the distances in kilometers. E.g. this step wouldn't be necessary if you wanted to simply find the closest pair of points.
Verification:
In [87]: lat = np.array([ 0.33356456, 0.33355585, 0.33355585, 0.33401788, 0.33370132])
In [88]: lng = np.array([ 1.27253229, 1.27249141, 1.27249141, 1.27259085, 1.2724337 ])
In [89]: sklearn_haversine(lat, lng)
Out[89]:
array([[ 0. , 0.25227021, 0.25227021, 2.90953323, 1.05422047],
[ 0.25227021, 0. , 0. , 3.00383463, 0.98975923],
[ 0.25227021, 0. , 0. , 3.00383463, 0.98975923],
[ 2.90953323, 3.00383463, 3.00383463, 0. , 2.2276139 ],
[ 1.05422047, 0.98975923, 0.98975923, 2.2276139 , 0. ]])
Performance:
In [91]: lat = np.random.randn(1000)
In [92]: lng = np.random.randn(1000)
In [93]: %timeit original_app(lat,lng)
1 loops, best of 3: 1.46 s per loop
In [94]: %timeit vectorized_app1(lat,lng)
10 loops, best of 3: 86.7 ms per loop
In [95]: %timeit vectorized_app2(lat,lng)
10 loops, best of 3: 75.7 ms per loop
In [96]: %timeit sklearn_haversine(lat,lng)
10 loops, best of 3: 76 ms per loop
In conclusion you can get the output of Divakar's vectorized_app1
with the speed of vectorized_app2
in shorter and simpler code.
回答3:
Assuming lat
and lng
as the lattitudes & longitudes arrays and that those have data in radians, here's one vectorized solution based upon this other solution -
# Elementwise differentiations for lattitudes & longitudes
dflat = lat[:,None] - lat
dflng = lng[:,None] - lng
# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))
Now, the above approach would give us output for all pairs irrespective of their order. Thus, we would have two distance outputs for the two pairs :(point1,point2)
& (point2,point1)
, even though the distances would be the same. So, to save on memory and hopefully better performance, you can create unique paired IDs with np.triu_indices and modify the earlier listed approach, like so -
# Elementwise differentiations for lattitudes & longitudes,
# but not repeat for the same paired elements
N = lat.size
idx1,idx2 = np.triu_indices(N,1)
dflat = lat[idx2] - lat[idx1]
dflng = lng[idx2] - lng[idx1]
# Finally Calculate haversine using its distance formula
d = np.sin(dflat/2)**2 + np.cos(lat[idx2])*np.cos(lat[idx1]) * np.sin(dflng/2)**2
hav_dists = 2 * 6371 * np.arcsin(np.sqrt(d))
Function definitions -
def original_app(lat,lng):
distance = []
R = 6371.0
for i in range(len(lat)):
for j in range(i+1,len(lat)):
dlon = lng[j]-lng[i]
dlat = lat[j]-lat[i]
a = sin(dlat / 2)**2 + cos(lat[i]) * cos(lat[j]) * sin(dlon / 2)**2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
distance.append(R * c)
return distance
def vectorized_app1(lat,lng):
dflat = lat[:,None] - lat
dflng = lng[:,None] - lng
d = np.sin(dflat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(dflng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
def vectorized_app2(lat,lng):
N = lat.size
idx1,idx2 = np.triu_indices(N,1)
dflat = lat[idx2] - lat[idx1]
dflng = lng[idx2] - lng[idx1]
d =np.sin(dflat/2)**2+np.cos(lat[idx2])*np.cos(lat[idx1])*np.sin(dflng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Verify output -
In [78]: lat
Out[78]: array([ 0.33356456, 0.33355585, 0.33355585, 0.33401788, 0.33370132])
In [79]: lng
Out[79]: array([ 1.27253229, 1.27249141, 1.27249141, 1.27259085, 1.2724337 ])
In [80]: original_app(lat,lng)
Out[80]:
[0.2522702110418014,
0.2522702110418014,
2.909533226553249,
1.0542204712876762,
0.0,
3.003834632906676,
0.9897592295963831,
3.003834632906676,
0.9897592295963831,
2.2276138997714474]
In [81]: vectorized_app1(lat,lng)
Out[81]:
array([[ 0. , 0.25227021, 0.25227021, 2.90953323, 1.05422047],
[ 0.25227021, 0. , 0. , 3.00383463, 0.98975923],
[ 0.25227021, 0. , 0. , 3.00383463, 0.98975923],
[ 2.90953323, 3.00383463, 3.00383463, 0. , 2.2276139 ],
[ 1.05422047, 0.98975923, 0.98975923, 2.2276139 , 0. ]])
In [82]: vectorized_app2(lat,lng)
Out[82]:
array([ 0.25227021, 0.25227021, 2.90953323, 1.05422047, 0. ,
3.00383463, 0.98975923, 3.00383463, 0.98975923, 2.2276139 ])
Runtime test -
In [83]: lat = np.random.randn(1000)
In [84]: lng = np.random.randn(1000)
In [85]: %timeit original_app(lat,lng)
1 loops, best of 3: 2.11 s per loop
In [86]: %timeit vectorized_app1(lat,lng)
1 loops, best of 3: 263 ms per loop
In [87]: %timeit vectorized_app2(lat,lng)
1 loops, best of 3: 224 ms per loop
Thus, for performance it seems vectorized_app2
might be the way to go!
来源:https://stackoverflow.com/questions/34557898/pairwise-haversine-distance-calculation