Fast Haversine Approximation (Python/Pandas)

匿名 (未验证) 提交于 2019-12-03 02:11:02

问题:

Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!

Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?

from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2):     """     Calculate the great circle distance between two points      on the earth (specified in decimal degrees)     """     # convert decimal degrees to radians      lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])     # haversine formula      dlon = lon2 - lon1      dlat = lat2 - lat1      a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2     c = 2 * asin(sqrt(a))      km = 6367 * c     return km   for index, row in df.iterrows():     df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude']) 

回答1:

Here is a vectorized numpy version of the same function:

import numpy as np  def haversine_np(lon1, lat1, lon2, lat2):     """     Calculate the great circle distance between two points     on the earth (specified in decimal degrees)      All args must be of equal length.          """     lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])      dlon = lon2 - lon1     dlat = lat2 - lat1      a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2      c = 2 * np.arcsin(np.sqrt(a))     km = 6367 * c     return km 

The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.

For example, with randomly generated values:

>>> import numpy as np >>> import pandas >>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000) >>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2}) >>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2']) 

Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.

This is an example of vectorization.



回答2:

Purely for the sake of an illustrative example, I took the numpy version in the answer from @ballsdotballs and also made a companion C implementation to be called via ctypes. Since numpy is such a highly optimized tool, there is little chance that my C code will be as efficient, but it should be somewhat close. The big advantage here is that by running through an example with C types, it can help you see how you can connect up your own personal C functions to Python without too much overhead. This is especially nice when you just want to optimize a small piece of a bigger computation by writing that small piece in some C source rather than Python. Simply using numpy will solve the problem most of the time, but for those cases when you don't really need all of numpy and you don't want to add the coupling to require use of numpy data types throughout some code, it's very handy to know how to drop down to the built-in ctypes library and do it yourself.

First let's create our C source file, called haversine.c:

#include <stdlib.h> #include <stdio.h> #include <math.h>  int haversine(size_t n,                double *lon1,                double *lat1,                double *lon2,                double *lat2,               double *kms){      if (   lon1 == NULL          || lon2 == NULL          || lat1 == NULL          || lat2 == NULL         || kms == NULL){         return -1;     }      double km, dlon, dlat;     double iter_lon1, iter_lon2, iter_lat1, iter_lat2;      double km_conversion = 2.0 * 6367.0;      double degrees2radians = 3.14159/180.0;      int i;     for(i=0; i < n; i++){         iter_lon1 = lon1[i] * degrees2radians;         iter_lat1 = lat1[i] * degrees2radians;         iter_lon2 = lon2[i] * degrees2radians;         iter_lat2 = lat2[i] * degrees2radians;          dlon = iter_lon2 - iter_lon1;         dlat = iter_lat2 - iter_lat1;          km = pow(sin(dlat/2.0), 2.0)             + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0);          kms[i] = km_conversion * asin(sqrt(km));     }      return 0; }  // main function for testing int main(void) {     double lat1[2] = {16.8, 27.4};     double lon1[2] = {8.44, 1.23};     double lat2[2] = {33.5, 20.07};     double lon2[2] = {14.88, 3.05};     double kms[2]  = {0.0, 0.0};     size_t arr_size = 2;      int res;     res = haversine(arr_size, lon1, lat1, lon2, lat2, kms);     printf("%d\n", res);      int i;     for (i=0; i < arr_size; i++){         printf("%3.3f, ", kms[i]);     }     printf("\n"); } 

Note that we're trying to keep with C conventions. Explicitly passing data arguments by reference, using size_t for a size variable, and expecting our haversine function to work by mutating one of the passed inputs such that it will contain the expected data on exit. The function actually returns an integer, which is a success/failure flag that could be used by other C-level consumers of the function.

We're going to need to find a way to handle all of these little C-specific issues inside of Python.

Next let's put our numpy version of the function along with some imports and some test data into a file called haversine.py:

import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt  def haversine(lon1, lat1, lon2, lat2):     """     Calculate the great circle distance between two points      on the earth (specified in decimal degrees)     """     # convert decimal degrees to radians      lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])     # haversine formula      dlon = lon2 - lon1      dlat = lat2 - lat1      a = (np.sin(dlat/2)**2           + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)     c = 2 * np.arcsin(np.sqrt(a))      km = 6367 * c     return km  if __name__ == "__main__":     lat1 = 50.0 * np.random.rand(1000000)     lon1 = 50.0 * np.random.rand(1000000)     lat2 = 50.0 * np.random.rand(1000000)     lon2 = 50.0 * np.random.rand(1000000)      t0 = time.time()     r1 = haversine(lon1, lat1, lon2, lat2)     t1 = time.time()     print t1-t0, r1 

I chose to make lats and lons (in degrees) that are randomly chosen between 0 and 50, but it doesn't matter too much for this explanation.

The next thing we need to do is to compile our C module in such a way that it can be dynamically loaded by Python. I'm using a Linux system (you can find examples for other systems very easily on Google), so my goal is to compile haversine.c into a shared object, like so:

gcc -shared -o haversine.so -fPIC haversine.c -lm 

We can also compile to an executable and run it to see what the C program's main function displays:

> gcc haversine.c -o haversine -lm > ./haversine 0 1964.322, 835.278,  

Now that we have compiled the shared object haversine.so, we can use ctypes to load it in Python and we need to supply the path to the file to do so:

lib_path = "/path/to/haversine.so" # Obviously use your real path here. haversine_lib = ctypes.CDLL(lib_path) 

Now haversine_lib.haversine acts pretty much just like a Python function, except that we might need to do some manual type marshaling to make sure the inputs and outputs are interpreted correctly.

numpy actually provides some nice tools for this and the one I'll use here is numpy.ctypeslib. We're going to build a pointer type that will allow us to pass around numpy.ndarrays to these ctypes-loaded functions as through they were pointers. Here's the code:

arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double,                                         ndim=1,                                         flags='CONTIGUOUS')  haversine_lib.haversine.restype = ctypes.c_int haversine_lib.haversine.argtypes = [ctypes.c_size_t,                                     arr_1d_double,                                      arr_1d_double,                                     arr_1d_double,                                     arr_1d_double,                                     arr_1d_double]  

Notice that we tell the haversine_lib.haversine function proxy to interpret its arguments according to the types we want.

Now, to test it out from Python what remains is to just make a size variable, and an array that will be mutated (just like in the C code) to contain the result data, then we can call it:

size = len(lat1) output = np.empty(size, dtype=np.double) print "=====" print output t2 = time.time() res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output) t3 = time.time() print t3 - t2, res print type(output), output 

Putting it all together in the __main__ block of haversine.py, the whole file now looks like this:

import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt  def haversine(lon1, lat1, lon2, lat2):     """     Calculate the great circle distance between two points      on the earth (specified in decimal degrees)     """     # convert decimal degrees to radians      lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])     # haversine formula      dlon = lon2 - lon1      dlat = lat2 - lat1      a = (np.sin(dlat/2)**2           + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2)     c = 2 * np.arcsin(np.sqrt(a))      km = 6367 * c     return km  if __name__ == "__main__":     lat1 = 50.0 * np.random.rand(1000000)     lon1 = 50.0 * np.random.rand(1000000)     lat2 = 50.0 * np.random.rand(1000000)     lon2 = 50.0 * np.random.rand(1000000)      t0 = time.time()     r1 = haversine(lon1, lat1, lon2, lat2)     t1 = time.time()     print t1-t0, r1      lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so"     haversine_lib = ctypes.CDLL(lib_path)     arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double,                                             ndim=1,                                             flags='CONTIGUOUS')      haversine_lib.haversine.restype = ctypes.c_int     haversine_lib.haversine.argtypes = [ctypes.c_size_t,                                         arr_1d_double,                                          arr_1d_double,                                         arr_1d_double,                                         arr_1d_double,                                         arr_1d_double]      size = len(lat1)     output = np.empty(size, dtype=np.double)     print "====="     print output     t2 = time.time()     res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output)     t3 = time.time()     print t3 - t2, res     print type(output), output 

To run it, which will run and time the Python and ctypes versions separately and print some results, we can just do

python haversine.py 

which displays:

0.111340045929 [  231.53695005  3042.84915093   169.5158946  ...,  1359.2656769   2686.87895954  3728.54788207] ===== [  6.92017600e-310   2.97780954e-316   2.97780954e-316 ...,    3.20676686e-001   1.31978329e-001   5.15819721e-001] 0.148446083069 0 <type 'numpy.ndarray'> [  231.53675618  3042.84723579   169.51575588 ...,  1359.26453029   2686.87709456  3728.54493339] 

As expected, the numpy version is slightly faster (0.11 seconds for vectors with length of 1 million) but our quick and dirty ctypes version is no slouch: a respectable 0.148 seconds on the same data.

Let's compare this to a naive for-loop solution in Python:

from math import radians, cos, sin, asin, sqrt  def slow_haversine(lon1, lat1, lon2, lat2):     n = len(lon1)     kms = np.empty(n, dtype=np.double)     for i in range(n):        lon1_v, lat1_v, lon2_v, lat2_v = map(            radians,             [lon1[i], lat1[i], lon2[i], lat2[i]]        )         dlon = lon2_v - lon1_v         dlat = lat2_v - lat1_v         a = (sin(dlat/2)**2              + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2)        c = 2 * asin(sqrt(a))         kms[i] = 6367 * c     return kms 

When I put this into the same Python file as the others and time it on the same million-element data, I consistently see a time of around 2.65 seconds on my machine.

So by quickly switching to ctypes we improve the speed by a factor of about 18. For many calculations that can benefit from access to bare, contiguous data, you often see gains much higher even than this.

Just to be super clear, I am not at all endorsing this as a better option than just using numpy. This is precisely the problem that numpy was built to solve, and so homebrewing your own ctypes code whenever it both (a) makes sense to incorporate numpy data types in your application and (b) there exists an easy way to map your code into a numpy equivalent, is not very efficient.

But it's still very helpful to know how to do this for those occasions when you prefer to write something in C yet call it in Python, or situations where a dependence on numpy is not practical (in an embedded system where numpy cannot be installed, for example).



回答3:

In case that using scikit-learn is allowed, I would give the following a chance:

from sklearn.neighbors import DistanceMetric dist = DistanceMetric.get_metric('haversine')  # example data lat1, lon1 = 36.4256345, -5.1510261 lat2, lon2 = 40.4165, -3.7026 lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])  X = [[lat1, lon1],      [lat2, lon2]] kms = 6367 print(kms * dist.pairwise(X)) 


回答4:

Some of these answers are "rounding" the radius of the earth. If you check these against other distance calculators (such as geopy), these functions will be off.

You can switch out R=3959.87433 for the conversion constant below if you want the answer in miles.

If you want kilometers, use R= 6372.8.

lon1 = -103.548851 lat1 = 32.0004311 lon2 = -103.6041946 lat2 = 33.374939   def haversine(lat1, lon1, lat2, lon2):        R = 3959.87433 # this is in miles.  For Earth radius in kilometers use 6372.8 km        dLat = radians(lat2 - lat1)       dLon = radians(lon2 - lon1)       lat1 = radians(lat1)       lat2 = radians(lat2)        a = sin(dLat/2)**2 + cos(lat1)*cos(lat2)*sin(dLon/2)**2       c = 2*asin(sqrt(a))        return R * c  print(haversine(lat1, lon1, lat2, lon2)) 


易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!