Vectorizing Haversine distance calculation in Python

后端未结

关注

 3  1464

I am trying to calculate a distance matrix for a long list of locations identified by Latitude & Longitude using the Haversine formula that takes two tuples of coordinat

相关标签:

3条回答

既然无缘

2020-11-29 11:52
start by getting all combinations using itertools.product
```
 results= [(p1,p2,haversine(p1,p2))for p1,p2 in itertools.product(points,repeat=2)]
```
that said Im not sure how fast it will be this looks like it might be a duplicate of Python: speeding up geographic comparison
0 讨论(0)
发布评论:

提交评论
- 加载中...

走了就别回头了

2020-11-29 12:01

From haversine's function definition, it looked pretty parallelizable. So, using one of the best tools for vectorization with NumPy aka broadcasting and replacing the math funcs with the NumPy equivalents ufuncs, here's one vectorized solution -

# Get data as a Nx2 shaped NumPy array
data = np.array(df['coordinates'].tolist())

# Convert to radians
data = np.deg2rad(data)                     

# Extract col-1 and 2 as latitudes and longitudes
lat = data[:,0]                     
lng = data[:,1]         

# Elementwise differentiations for lattitudes & longitudes
diff_lat = lat[:,None] - lat
diff_lng = lng[:,None] - lng

# Finally Calculate haversine
d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))

Runtime tests -

The other np.vectorize based solution has shown some positive promise on performance improvement over the original code, so this section would compare the posted broadcasting based approach against that one.

Function definitions -

def vectotized_based(df):
    haver_vec = np.vectorize(haversine, otypes=[np.int16])
    return df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

def broadcasting_based(df):
    data = np.array(df['coordinates'].tolist())
    data = np.deg2rad(data)                     
    lat = data[:,0]                     
    lng = data[:,1]         
    diff_lat = lat[:,None] - lat
    diff_lng = lng[:,None] - lng
    d = np.sin(diff_lat/2)**2 + np.cos(lat[:,None])*np.cos(lat) * np.sin(diff_lng/2)**2
    return 2 * 6371 * np.arcsin(np.sqrt(d))

Timings -

In [123]: # Input
     ...: length = 500
     ...: d1 = np.random.uniform(-90, 90, length)
     ...: d2 = np.random.uniform(-180, 180, length)
     ...: coords = tuple(zip(d1, d2))
     ...: df = pd.DataFrame({'id':np.arange(length), 'coordinates':coords})
     ...: 

In [124]: %timeit vectotized_based(df)
1 loops, best of 3: 1.12 s per loop

In [125]: %timeit broadcasting_based(df)
10 loops, best of 3: 68.7 ms per loop

0 讨论(0)

没有蜡笔的小新

2020-11-29 12:06

You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

For instance, with sample data as follows:

length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

compare for 500 points:

def haver_vect(data):
    distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
    return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

0 讨论(0)