Fast (< n^2) clustering algorithm

前端 未结 6 1595
情话喂你
情话喂你 2021-01-30 00:27

I have 1 million 5-dimensional points that I need to group into k clusters with k << 1 million. In each cluster, no two points should be too far apart (e.g. they could be

6条回答
  •  一向
    一向 (楼主)
    2021-01-30 00:52

    I have a Perl module that does exactly what you want Algorithm::ClusterPoints.

    First, it uses the algorithm you have described in your post to divide the points in multidimensional sectors and then it uses brute force to find clusters between points in adjacent sectors.

    The complexity varies from O(N) to O(N**2) in very degraded cases.

    update:

    @Denis: no, it is much worse:

    For d dimensions, the sector (or little hypercube) size s is determined so that its diagonal l is the minimum distance c allowed between two points in different clusters.

    l = c
    l = sqrt(d * s * s)
    s = sqrt(c * c / d) = c / sqrt(d)
    

    Then you have to consider all the sectors that touch the hypersphere with diameter r = 2c + l centered in the pivot sector.

    Roughly, we have to consider ceil(r/s) rows of sectors in every directions and that means n = pow(2 * ceil(r/s) + 1, d).

    For instance, for d=5 and c=1 we get l=2.236, s=0.447, r=3.236 and n=pow(9, 5)=59049

    Actually we have to check less neighbor sectors as here we are considering those that touch the hypercube of size (2r+1)/s and we only need to check those touching the circumscribed hypersphere.

    Considering the bijective nature of the "are on the same cluster" relation we can also half the number of sectors that have to be checked.

    Specifically, Algorithm::ClusterPoints for the case where d=5 checks 3903 sectors.

提交回复
热议问题