Each row in a Pandas dataframe contains lat/lng coordinates of 2 points. Using the Python code below, calculating the distances between these 2 points for many (millions) of rows takes a very long time!
Considering that the 2 points are under 50 miles apart and accuracy is not very important, is it possible to make the calculation faster?
from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) km = 6367 * c return km for index, row in df.iterrows(): df.loc[index, 'distance'] = haversine(row['a_longitude'], row['a_latitude'], row['b_longitude'], row['b_latitude'])
Here is a vectorized numpy version of the same function:
import numpy as np def haversine_np(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) All args must be of equal length. """ lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2 c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km
The inputs are all arrays of values, and it should be able to do millions of points instantly. The requirement is that the inputs are ndarrays but the columns of your pandas table will work.
For example, with randomly generated values:
>>> import numpy as np >>> import pandas >>> lon1, lon2, lat1, lat2 = np.random.randn(4, 1000000) >>> df = pandas.DataFrame(data={'lon1':lon1,'lon2':lon2,'lat1':lat1,'lat2':lat2}) >>> km = haversine_np(df['lon1'],df['lat1'],df['lon2'],df['lat2'])
Looping through arrays of data is very slow in python. Numpy provides functions that operate on entire arrays of data, which lets you avoid looping and drastically improve performance.
This is an example of vectorization.
Purely for the sake of an illustrative example, I took the numpy
version in the answer from @ballsdotballs and also made a companion C implementation to be called via ctypes
. Since numpy
is such a highly optimized tool, there is little chance that my C code will be as efficient, but it should be somewhat close. The big advantage here is that by running through an example with C types, it can help you see how you can connect up your own personal C functions to Python without too much overhead. This is especially nice when you just want to optimize a small piece of a bigger computation by writing that small piece in some C source rather than Python. Simply using numpy
will solve the problem most of the time, but for those cases when you don't really need all of numpy
and you don't want to add the coupling to require use of numpy
data types throughout some code, it's very handy to know how to drop down to the built-in ctypes
library and do it yourself.
First let's create our C source file, called haversine.c
:
#include <stdlib.h> #include <stdio.h> #include <math.h> int haversine(size_t n, double *lon1, double *lat1, double *lon2, double *lat2, double *kms){ if ( lon1 == NULL || lon2 == NULL || lat1 == NULL || lat2 == NULL || kms == NULL){ return -1; } double km, dlon, dlat; double iter_lon1, iter_lon2, iter_lat1, iter_lat2; double km_conversion = 2.0 * 6367.0; double degrees2radians = 3.14159/180.0; int i; for(i=0; i < n; i++){ iter_lon1 = lon1[i] * degrees2radians; iter_lat1 = lat1[i] * degrees2radians; iter_lon2 = lon2[i] * degrees2radians; iter_lat2 = lat2[i] * degrees2radians; dlon = iter_lon2 - iter_lon1; dlat = iter_lat2 - iter_lat1; km = pow(sin(dlat/2.0), 2.0) + cos(iter_lat1) * cos(iter_lat2) * pow(sin(dlon/2.0), 2.0); kms[i] = km_conversion * asin(sqrt(km)); } return 0; } // main function for testing int main(void) { double lat1[2] = {16.8, 27.4}; double lon1[2] = {8.44, 1.23}; double lat2[2] = {33.5, 20.07}; double lon2[2] = {14.88, 3.05}; double kms[2] = {0.0, 0.0}; size_t arr_size = 2; int res; res = haversine(arr_size, lon1, lat1, lon2, lat2, kms); printf("%d\n", res); int i; for (i=0; i < arr_size; i++){ printf("%3.3f, ", kms[i]); } printf("\n"); }
Note that we're trying to keep with C conventions. Explicitly passing data arguments by reference, using size_t
for a size variable, and expecting our haversine
function to work by mutating one of the passed inputs such that it will contain the expected data on exit. The function actually returns an integer, which is a success/failure flag that could be used by other C-level consumers of the function.
We're going to need to find a way to handle all of these little C-specific issues inside of Python.
Next let's put our numpy
version of the function along with some imports and some test data into a file called haversine.py
:
import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = (np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2) c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km if __name__ == "__main__": lat1 = 50.0 * np.random.rand(1000000) lon1 = 50.0 * np.random.rand(1000000) lat2 = 50.0 * np.random.rand(1000000) lon2 = 50.0 * np.random.rand(1000000) t0 = time.time() r1 = haversine(lon1, lat1, lon2, lat2) t1 = time.time() print t1-t0, r1
I chose to make lats and lons (in degrees) that are randomly chosen between 0 and 50, but it doesn't matter too much for this explanation.
The next thing we need to do is to compile our C module in such a way that it can be dynamically loaded by Python. I'm using a Linux system (you can find examples for other systems very easily on Google), so my goal is to compile haversine.c
into a shared object, like so:
gcc -shared -o haversine.so -fPIC haversine.c -lm
We can also compile to an executable and run it to see what the C program's main
function displays:
> gcc haversine.c -o haversine -lm > ./haversine 0 1964.322, 835.278,
Now that we have compiled the shared object haversine.so
, we can use ctypes
to load it in Python and we need to supply the path to the file to do so:
lib_path = "/path/to/haversine.so" # Obviously use your real path here. haversine_lib = ctypes.CDLL(lib_path)
Now haversine_lib.haversine
acts pretty much just like a Python function, except that we might need to do some manual type marshaling to make sure the inputs and outputs are interpreted correctly.
numpy
actually provides some nice tools for this and the one I'll use here is numpy.ctypeslib
. We're going to build a pointer type that will allow us to pass around numpy.ndarrays
to these ctypes
-loaded functions as through they were pointers. Here's the code:
arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, ndim=1, flags='CONTIGUOUS') haversine_lib.haversine.restype = ctypes.c_int haversine_lib.haversine.argtypes = [ctypes.c_size_t, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double]
Notice that we tell the haversine_lib.haversine
function proxy to interpret its arguments according to the types we want.
Now, to test it out from Python what remains is to just make a size variable, and an array that will be mutated (just like in the C code) to contain the result data, then we can call it:
size = len(lat1) output = np.empty(size, dtype=np.double) print "=====" print output t2 = time.time() res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output) t3 = time.time() print t3 - t2, res print type(output), output
Putting it all together in the __main__
block of haversine.py
, the whole file now looks like this:
import time import ctypes import numpy as np from math import radians, cos, sin, asin, sqrt def haversine(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = (np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2) c = 2 * np.arcsin(np.sqrt(a)) km = 6367 * c return km if __name__ == "__main__": lat1 = 50.0 * np.random.rand(1000000) lon1 = 50.0 * np.random.rand(1000000) lat2 = 50.0 * np.random.rand(1000000) lon2 = 50.0 * np.random.rand(1000000) t0 = time.time() r1 = haversine(lon1, lat1, lon2, lat2) t1 = time.time() print t1-t0, r1 lib_path = "/home/ely/programming/python/numpy_ctypes/haversine.so" haversine_lib = ctypes.CDLL(lib_path) arr_1d_double = np.ctypeslib.ndpointer(dtype=np.double, ndim=1, flags='CONTIGUOUS') haversine_lib.haversine.restype = ctypes.c_int haversine_lib.haversine.argtypes = [ctypes.c_size_t, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double, arr_1d_double] size = len(lat1) output = np.empty(size, dtype=np.double) print "=====" print output t2 = time.time() res = haversine_lib.haversine(size, lon1, lat1, lon2, lat2, output) t3 = time.time() print t3 - t2, res print type(output), output
To run it, which will run and time the Python and ctypes
versions separately and print some results, we can just do
python haversine.py
which displays:
0.111340045929 [ 231.53695005 3042.84915093 169.5158946 ..., 1359.2656769 2686.87895954 3728.54788207] ===== [ 6.92017600e-310 2.97780954e-316 2.97780954e-316 ..., 3.20676686e-001 1.31978329e-001 5.15819721e-001] 0.148446083069 0 <type 'numpy.ndarray'> [ 231.53675618 3042.84723579 169.51575588 ..., 1359.26453029 2686.87709456 3728.54493339]
As expected, the numpy
version is slightly faster (0.11 seconds for vectors with length of 1 million) but our quick and dirty ctypes
version is no slouch: a respectable 0.148 seconds on the same data.
Let's compare this to a naive for-loop solution in Python:
from math import radians, cos, sin, asin, sqrt def slow_haversine(lon1, lat1, lon2, lat2): n = len(lon1) kms = np.empty(n, dtype=np.double) for i in range(n): lon1_v, lat1_v, lon2_v, lat2_v = map( radians, [lon1[i], lat1[i], lon2[i], lat2[i]] ) dlon = lon2_v - lon1_v dlat = lat2_v - lat1_v a = (sin(dlat/2)**2 + cos(lat1_v) * cos(lat2_v) * sin(dlon/2)**2) c = 2 * asin(sqrt(a)) kms[i] = 6367 * c return kms
When I put this into the same Python file as the others and time it on the same million-element data, I consistently see a time of around 2.65 seconds on my machine.
So by quickly switching to ctypes
we improve the speed by a factor of about 18. For many calculations that can benefit from access to bare, contiguous data, you often see gains much higher even than this.
Just to be super clear, I am not at all endorsing this as a better option than just using numpy
. This is precisely the problem that numpy
was built to solve, and so homebrewing your own ctypes
code whenever it both (a) makes sense to incorporate numpy
data types in your application and (b) there exists an easy way to map your code into a numpy
equivalent, is not very efficient.
But it's still very helpful to know how to do this for those occasions when you prefer to write something in C yet call it in Python, or situations where a dependence on numpy
is not practical (in an embedded system where numpy
cannot be installed, for example).