I\'ve got millions of geographic points. For each one of these, I want to find all \"neighboring points,\" i.e., all other points within some radius, say a few hundred meters.<
Tipped off by Eamon, I've come up with a simple solution using btrees implemented in SciPy.
from scipy.spatial import cKDTree
from scipy import inf
max_distance = 0.0001 # Assuming lats and longs are in decimal degrees, this corresponds to 11.1 meters
points = [(lat1, long1), (lat2, long2) ... ]
tree = cKDTree(points)
point_neighbors_list = [] # Put the neighbors of each point here
for point in points:
distances, indices = tree.query(point, len(points), p=2, distance_upper_bound=max_distance)
point_neighbors = []
for index, distance in zip(indices, distances):
if distance == inf:
break
point_neighbors.append(points[index])
point_neighbors_list.append(point_neighbors)
First things first: there are preexisting algorithms to do things kind of thing, such as the k-d tree. Scipy has a python implementation cKDtree that can find all points in a given range.
Depending on what you're doing however, implementing something like that may be nontrivial. Furthermore, creating a tree is fairly complex (potentially quite a bit of overhead), and you may be able to get away with a simple hack I've used before:
Effectively, you're doing O(N log(N)) preprocessing, and for each point roughly o(sqrt(N)) - or more, if the distribution of your points is poor. If the points are roughly uniformly distributed, the number of points nearer in X than the nearest neighbor will be on the order of the square root of N. It's less efficient if many points are within your range, but never much worse than brute force.
One advantage of this method is that's it all executable in very few memory allocations, and can mostly be done with very good memory locality, which means that it performs quite well despite the obvious limitations.
Another idea: a Delauney triangulation could work. For the Delauney triangulation, it's given that any point's nearest neighbor is an adjacent node. The intuition is that during a search, you can maintain a heap (priority queue) based on absolute distance from query point. Pick the nearest point, check that it's in range, and if so add all its neighbors. I suspect that it's impossible to miss any points like this, but you'd need to look at it more carefully to be sure...