How to compare clusters?

前端 未结 5 785
既然无缘
既然无缘 2021-01-18 13:22

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like

5条回答
  •  一向
    一向 (楼主)
    2021-01-18 13:26

    After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.

    I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:

    def alignClusters(clstr1,clstr2):
    """Given 2 cluster assignments, this funciton will rename the second to 
       maximize alignment of elements within each cluster. This method is 
       described in in Menéndez, Héctor D. A genetic approach to the graph and 
       spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
       are consecutive integers starting with zero)
    
       INPUTS:
       clstr1 - The first clustering assignment
       clstr2 - The second clustering assignment
    
       OUTPUTS:
       clstr2_temp - The second clustering assignment with clusters renumbered to
       maximize alignment with the first clustering assignment """
    K = np.max(clstr1)+1
    simdist = np.zeros((K,K))
    
    for i in range(K):
        for j in range(K):
            dcix = clstr1==i
            dcjx = clstr2==j
            dd = np.dot(dcix.astype(int),dcjx.astype(int))
            simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
    mask = np.zeros((K,K))
    for i in range(K):
        simdist_vec = np.reshape(simdist.T,(K**2,1))
        I = np.argmax(simdist_vec)
        xy = np.unravel_index(I,simdist.shape,order='F')
        x = xy[0]
        y = xy[1]
        mask[x,y] = 1
        simdist[x,:] = 0
        simdist[:,y] = 0
    swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
    swapI = swapIJ[0][1,:]
    swapJ = swapIJ[0][0,:]
    clstr2_temp = np.copy(clstr2)
    for k in range(swapI.shape[0]):
        swapj = [swapJ[k]==i for i in clstr2]
        clstr2_temp[swapj] = swapI[k]
    return clstr2_temp
    

提交回复
热议问题