Scikit Learn - K-Means - Elbow - criterion

前端 未结 3 1475
耶瑟儿~
耶瑟儿~ 2021-01-30 02:40

Today i\'m trying to learn something about K-means. I Have understand the algorithm and i know how it works. Now i\'m looking for the right k... I found the elbow criterion as a

3条回答
  •  长发绾君心
    2021-01-30 03:00

    The elbow criterion is a visual method. I have not yet seen a robust mathematical definition of it. But k-means is a pretty crude heuristic, too.

    So yes, you will need to run k-means with k=1...kmax, then plot the resulting SSQ and decide upon an "optimal" k.

    There exist advanced versions of k-means such as X-means that will start with k=2 and then increase it until a secondary criterion (AIC/BIC) no longer improves. Bisecting k-means is an approach that also starts with k=2 and then repeatedly splits clusters until k=kmax. You could probably extract the interim SSQs from it.

    Either way, I have the impression that in any actual use case where k-mean is really good, you do actually know the k you need beforehand. In these cases, k-means is actually not so much a "clustering" algorithm, but a vector quantization algorithm. E.g. reducing the number of colors of an image to k. (where often you would choose k to be e.g. 32, because that is then 5 bits color depth and can be stored in a bit compressed way). Or e.g. in bag-of-visual-words approaches, where you would choose the vocabulary size manually. A popular value seems to be k=1000. You then don't really care much about the quality of the "clusters", but the main point is to be able to reduce an image to a 1000 dimensional sparse vector. The performance of a 900 dimensional or a 1100 dimensional representation will not be substantially different.

    For actual clustering tasks, i.e. when you want to analyze the resulting clusters manually, people usually use more advanced methods than k-means. K-means is more of a data simplification technique.

提交回复
热议问题