how to use different distance formula other than euclidean distance in k means

主宰稳场 提交于 2019-12-21 20:42:28

问题


I am working with latitude longitude data. I have to make clusters based on distance between two points. Now distance between two different point is =ACOS(SIN(lat1)*SIN(lat2)+COS(lat1)*COS(lat2)*COS(lon2-lon1))*6371

I want to use k means in R. Is there any way I can override distance calculation in that process?


回答1:


K-means is not distance based

It is based on variance minimization. The sum-of-variance formula equals the sum of squared Euclidean distances, but the converse, for other distances, will not hold.

If you want to have an k-means like algorithm for other distances (where the mean is not an appropriate estimator), use k-medoids (PAM). In contrast to k-means, k-medoids will converge with arbitrary distance functions!

For Manhattan distance, you can also use K-medians. The median is an appropriate estimator for L1 norms (the median minimizes the sum-of-differences; the mean minimizes the sum-of-squared-distances).

For your particular use case, you could also transform your data into 3D space, then use (squared) Euclidean distance and thus k-means. But your cluster centers will be somewhere underground!




回答2:


If you have a data.frame, df, with columns for lat and long, then you should be able to use the earth.dist(...) function in the fossil package to calculate a distance matrix, and pass that to pam(...) in the cluster package to do the clustering.

library(fossil)
library(cluster)
df    <- data.frame(long=<longituces>, lat=<latitudes>))
dist  <- earth.dist(df, dist=T)
clust <- pam(dist, k, diss=T)

See earth.dist(...), and pam(...) for documentation



来源:https://stackoverflow.com/questions/20655013/how-to-use-different-distance-formula-other-than-euclidean-distance-in-k-means

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!