Affinity Propagation preferences initialization

后端 未结 3 2423
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-20 05:18

I need to perform clustering without knowing in advance the number of clusters. The number of cluster may be from 1 to 5, since I may find cases where all the samples belong to

相关标签:
3条回答
  • 2021-02-20 05:42

    You can also merge clusters together by essentially running the algorithm a second time using the center samples or manually merging the most similar ones. So you could iteratively merge the closest clusters till you get your number, making the choice of preference easier since you can just choose anything that will result in a decent number of clusters (This worked decently well when I tried).

    0 讨论(0)
  • 2021-02-20 05:46

    No, there is no flaw. AP does not use distances, but requires you to specify a similarity. I don't know the scikit implementation so well, but according to what I read, it uses negative squared Euclidean distances by default to compute the similarity matrix. If you set the input preference to the minimal Euclidean distance, you get a positive value, while all similarities are negative. So this will typically result in as many clusters as you have samples (note: the higher the input preference, the more clusters). I'd rather suggest to set the input preference to the minimal negative squared distance, i.e. -1 times the square of the largest distance in the data set. This will give you a much smaller number of clusters, but not necessarily one single cluster. I don't know whether the preferenceRange() function exists also in the scikit implementation. There is Matlab code on the AP homepage and it is also implemented in the R package 'apcluster' that I am maintaining. This function allows for determining meaningful bounds for the input preference parameter. I hope that helps.

    0 讨论(0)
  • 2021-02-20 05:48

    You can control it by specifying the minimum preferences, but it's not sure that you will found a single cluster.

    And also, I would suggest you to don't wanna make a single cluster because it would generate errors, as some data must not be the same or have similarity with examplers but as you provide the minimum preferences so the AP will commit the error.

    0 讨论(0)
提交回复
热议问题