How to do clustering using the matrix of correlation coefficients?

后端 未结 1 1220
鱼传尺愫
鱼传尺愫 2021-02-06 07:40

I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix?

Can I use linkage and fcluster function in SciPy?

L

相关标签:
1条回答
  • 2021-02-06 08:09

    Clustering data using a correlation matrix is a reasonable idea, but one has to pre-process the correlations first. First, the correlation matrix, as returned by numpy.corrcoef, is affected by the errors of machine arithmetics:

    1. It is not always symmetric.
    2. Diagonal terms are not always exactly 1

    These can be fixed by taking average with the transpose, and filling the diagonal with 1:

    import numpy as np
    data = np.random.randint(0, 10, size=(20, 10))   # 20 variables with 10 observations each
    corr = np.corrcoef(data)                         # 20 by 20 correlation matrix
    corr = (corr + corr.T)/2                         # made symmetric
    np.fill_diagonal(corr, 1)                        # put 1 on the diagonal
    

    Second, the input to any clustering method, such as linkage, needs to measure the dissimilarity of objects. The correlation measures similarity. So it needs to be transformed in a way such that 0 correlation is mapped to a large number, while 1 correlation is mapped to 0.

    This blog post discusses several ways of such data transformation, and recommends dissimilarity = 1 - abs(correlation). The idea is that strong negative correlation is also an indication that the objects are related, just as positive correlation is. Here is the continuation of the example:

    from scipy.cluster.hierarchy import linkage, fcluster
    from scipy.spatial.distance import squareform
    
    dissimilarity = 1 - np.abs(corr)
    hierarchy = linkage(squareform(dissimilarity), method='average')
    labels = fcluster(hierarchy, 0.5, criterion='distance')
    

    Note that we don't feed a full distance matrix into linkage, it needs to be compressed with squareform first.

    What exact clustering methods to use, and what thresholds, depends on the context of your problem, there are no universal rules. Often, 0.5 is a reasonable threshold to use for correlation, so I did that. With my 20 sets of random numbers I ended up with 7 clusters: encoded in labels as

    [7, 7, 7, 1, 4, 4, 2, 7, 5, 7, 2, 5, 6, 3, 6, 1, 5, 1, 4, 2] 
    
    0 讨论(0)
提交回复
热议问题