问题
I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371
I want to use k means in R. Is there any way I can override distance calculation in that process?
回答1:
K-means is not distance based
It is based on variance minimization. The sum-of-variance formula equals the sum of squared Euclidean distances, but the converse, for other distances, will not hold.
If you want to have an k-means like algorithm for other distances (where the mean is not an appropriate estimator), use k-medoids (PAM). In contrast to k-means, k-medoids will converge with arbitrary distance functions!
For Manhattan distance, you can also use K-medians. The median is an appropriate estimator for L1 norms (the median minimizes the sum-of-differences; the mean minimizes the sum-of-squared-distances).
For your particular use case, you could also transform your data into 3D space, then use (squared) Euclidean distance and thus k-means. But your cluster centers will be somewhere underground!
回答2:
If you have a data.frame, df
, with columns for lat
and long
, then you should be able to use the earth.dist(...)
function in the fossil
package to calculate a distance matrix, and pass that to pam(...)
in the cluster
package to do the clustering.
library(fossil)
library(cluster)
df <- data.frame(long=<longituces>, lat=<latitudes>))
dist <- earth.dist(df, dist=T)
clust <- pam(dist, k, diss=T)
See earth.dist(...), and pam(...) for documentation
来源:https://stackoverflow.com/questions/20655013/how-to-use-different-distance-formula-other-than-euclidean-distance-in-k-means