I have a dataframe with latitude and longitude pairs.
Here is my dataframe look like.
order_lat order_long
0 19.111841 72.910729
1 19.1113
DBSCAN is meant to be used on the raw data, with a spatial index for acceleration. The only tool I know with acceleration for geo distances is ELKI (Java) - scikit-learn unfortunately only supports this for a few distances like Euclidean distance (see sklearn.neighbors.NearestNeighbors
).
But apparently, you can affort to precompute pairwise distances, so this is not (yet) an issue.
However, you did not read the documentation carefully enough, and your assumption that DBSCAN uses a distance matrix is wrong:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=2,min_samples=5)
db.fit_predict(distance_matrix)
uses Euclidean distance on the distance matrix rows, which obviously does not make any sense.
See the documentation of DBSCAN
(emphasis added):
class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric='euclidean', algorithm='auto', leaf_size=30, p=None, random_state=None)
metric : string, or callable
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only “nonzero” elements may be considered neighbors for DBSCAN.
similar for fit_predict
:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
In other words, you need to do
db = DBSCAN(eps=2, min_samples=5, metric="precomputed")
You can cluster spatial latitude-longitude data with scikit-learn's DBSCAN without precomputing a distance matrix.
db = DBSCAN(eps=2/6371., min_samples=5, algorithm='ball_tree', metric='haversine').fit(np.radians(coordinates))
This comes from this tutorial on clustering spatial data with scikit-learn DBSCAN. In particular, notice that the eps
value is still 2km, but it's divided by 6371 to convert it to radians. Also, notice that .fit()
takes the coordinates in radian units for the haversine metric.
@eos Gives the best answer I think - as well as making use of Haversine distance (the most relevant distance measure in this case), it avoids the need to generate a precomputed distance matrix. If you create a distance matrix then you need to calculate the pairwise distances for every combination of points (although you can obviously save a bit of time by taking advantage of the fact that your distance metric is symmetric).
If you just supply DBSCAN with a distance measure and use the ball_tree
algorithm though, it can avoid the need to calculate every possible distance. This is because the ball tree algorithm can use the triangular inequality theorem to reduce the number of candidates that need to be checked to find the nearest neighbours of a data point (this is the biggest job in DBSCAN).
The triangular inequality theorem states:
|x+y| <= |x| + |y|
...so if a point p
is distance x
from its neighbour n
, and another point q
is a distance y
from p
, if x+y
is greater than our nearest neighbour radius, we know that q
must be too far away from n
to be considered a neighbour, so we don't need to calculate its distance.
Read more about how ball trees work in the scikit-learn documentation
I don't know what implementation of haversine
you're using but it looks like it returns results in km so eps
should be 0.2, not 2 for 200 m.
For the min_samples
parameter, that depends on what your expected output is. Here are a couple of examples. My outputs are using an implementation of haversine
based on this answer which gives a distance matrix similar, but not identical to yours.
This is with db = DBSCAN(eps=0.2, min_samples=5)
[ 0 -1 -1 -1 1 1 1 -1 -1 1 1 1 2 2 1 1 1 -1 -1 -1 -1 1 -1 -1 -1 -1 -1 1 1 -1 1 1 1 1 1 2 0 -1 1 2 2 0 0 0 -1 -1 -1 1 1 1 -1 -1 1 -1 -1 1]
This creates three clusters, 0, 1
and 2
, and a lot of the samples don't fall into a cluster with at least 5 members and so are not assigned to a cluster (shown as -1
).
Trying again with a smaller min_samples
value:
db = DBSCAN(eps=0.2, min_samples=2)
[ 0 1 1 2 3 3 3 4 4 3 3 3 5 5 3 3 3 2 6 6 7 3 2 2 8 8 8 3 3 6 3 3 3 3 3 5 0 -1 3 5 5 0 0 0 6 -1 -1 3 3 3 7 -1 3 -1 -1 3]
Here most of the samples are within 200m of at least one other sample and so fall into one of eight clusters 0
to 7
.
Edited to add
It looks like @Anony-Mousse is right, though I didn't see anything wrong in my results. For the sake of contributing something, here's the code I was using to see the clusters:
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import pandas as pd
def haversine(lonlat1, lonlat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
# convert decimal degrees to radians
lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
X = pd.read_csv('dbscan_test.csv')
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed') # using "precomputed" as recommended by @Anony-Mousse
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
plt.scatter(X['lat'], X['lng'], c=X['cluster'])
plt.show()
There are three different things you can do to use DBSCAN with GPS data. The first is that you can use the eps parameter to specify the maximum distance between data points that you will consider to create a cluster, as specified in other answers you need to take into account the scale of the distance metric you are using a pick a value that makes sense. Then you can use the min_samples this can be used as a way to filtering out data points while moving. Last the metric will allow you to use whatever distance you want.
As an example, in a particular research project I'm working on I want to extract significant locations from a subject's GPS data locations collected from their smartphone. I'm not interested on how the subject navigates through the city and also I'm more comfortable dealing with distances in meters then I can do the next:
from geopy import distance
def mydist(p1, p2):
return distance.great_circle((p1[0],p1[1],100),(p2[0],p2[1],100)).meters
DBSCAN(eps=50,min_samples=50,n_jobs=-1,metric=mydist)
Here eps as per the DBSCAN documentation "The maximum distance between two samples for one to be considered as in the neighborhood of the other." While min samples is "The number of samples (or total weight) in a neighborhood for a point to be considered as a core point." Basically with eps you control how close data points in a cluster should be, in the example above I selected 100 meters. Min samples is just a way to control for density, in the example above the data was captured at about one sample per second, because I'm not interested in when people are moving around but instead stationary locations I want to make sure I get at least the equivalent of 60 seconds of GPS data from the same location.
If this still does not make sense take a look at this DBSCAN animation.