What makes the distance measure in k-medoid “better” than k-means?

后端未结

关注

 3  1239

旧时难觅i 2021-01-30 04:28

I am reading about the difference between k-means clustering and k-medoid clustering.

Supposedly there is an advantage to using the pairwise distance measure in the k-me

3条回答

一向 (楼主)

2021-01-30 05:03
1. K-medoid is more flexible

First of all, you can use k-medoids with any similarity measure. K-means however, may fail to converge - it really must only be used with distances that are consistent with the mean. So e.g. Absolute Pearson Correlation must not be used with k-means, but it works well with k-medoids.

2. Robustness of medoid

Secondly, the medoid as used by k-medoids is roughly comparable to the median (in fact, there also is k-medians, which is like K-means but for Manhattan distance). If you look up literature on the median, you will see plenty of explanations and examples why the median is more robust to outliers than the arithmetic mean. Essentially, these explanations and examples will also hold for the medoid. It is a more robust estimate of a representative point than the mean as used in k-means.

Consider this 1-dimensional example:
```
[1, 2, 3, 4, 100000]
```
Both the median and medoid of this set are 3. The mean is 20002.

Which do you think is more representative of the data set? The mean has the lower squared error, but assuming that there might be a measurement error in this data set ...

Technically, the notion of breakdown point is used in statistics. The median has a breakdown point of 50% (i.e. half of the data points can be incorrect, and the result is still unaffected), whereas the mean has a breakdown point of 0 (i.e. a single large observation can yield a bad estimate).

I do not have a proof, but I assume the medoid will have a similar breakdown point as the median.

3. k-medoids is much more expensive

That's the main drawback. Usually, PAM takes much longer to run than k-means. As it involves computing all pairwise distances, it is O(n^2*k*i); whereas k-means runs in O(n*k*i) where usually, k times the number of iterations is k*i << n.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

What makes the distance measure in k-medoid “better” than k-means?

1. K-medoid is more flexible

2. Robustness of medoid

3. k-medoids is much more expensive