Why not use just Canopy clustering instead of combining with KMeans Mahout

╄→尐↘猪︶ㄣ 提交于 2019-12-11 08:33:18

问题


The question is in the title - if Canopy can be used for clustering, as well as for determining centroids, why not use it for clustering, instead of using it just to generate centroids as input for KMeans clustering?

I'm considering implementation using Mahout, but I think that this is more a concept, not too much related to system.

Thanks


回答1:


Canopy is deprecated from Mahout so I wouldn't use it at all.

It is fast so the idea was to make a quick better than random estimate of starting centroids so that kmeans converged quicker.

Canopy has no convergence criteria so it's first guess is all you get. Kmeans iterates following an algorithm called gradient descent to find local minimums of the defined error function. So it converges towards better guesses but generally you start from a random centroid hoping that it was placed well. Canopy was an attempt to place the starting centroid better but did not work much if at all better than random.

So you could just take Canopy's guess and calculate clusters by going through all vectors and finding which canopy centroid they were closest to but the clusters would not have the benefit of iteration and would score worse on cross validation tests.



来源:https://stackoverflow.com/questions/25447935/why-not-use-just-canopy-clustering-instead-of-combining-with-kmeans-mahout

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!