How to compare clusters?

前端 未结 5 780
既然无缘
既然无缘 2021-01-18 13:22

Hopefully this can be done with python! I used two clustering programs on the same data and now have a cluster file from both. I reformatted the files so that they look like

相关标签:
5条回答
  • 2021-01-18 13:26

    After learning so much from Stackoverflow, finally I have an opportunity to give back! A different approach from those offered so far is to relabel clusters to maximize alignment, and then comparison becomes easy. For example, if one algorithm assigns labels to a set of six items as L1=[0,0,1,1,2,2] and another assigns L2=[2,2,0,0,1,1], you want these two labelings to be equivalent since L1 and L2 are essentially segmenting items into clusters identically. This approach relabels L2 to maximize alignment, and in the example above, will result in L2==L1.

    I found a soution to this problem in "Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012." and below is an implementation in Python using numpy. I'm relatively new to Python, so there may be better implementations, but I think this gets the job done:

    def alignClusters(clstr1,clstr2):
    """Given 2 cluster assignments, this funciton will rename the second to 
       maximize alignment of elements within each cluster. This method is 
       described in in Menéndez, Héctor D. A genetic approach to the graph and 
       spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
       are consecutive integers starting with zero)
    
       INPUTS:
       clstr1 - The first clustering assignment
       clstr2 - The second clustering assignment
    
       OUTPUTS:
       clstr2_temp - The second clustering assignment with clusters renumbered to
       maximize alignment with the first clustering assignment """
    K = np.max(clstr1)+1
    simdist = np.zeros((K,K))
    
    for i in range(K):
        for j in range(K):
            dcix = clstr1==i
            dcjx = clstr2==j
            dd = np.dot(dcix.astype(int),dcjx.astype(int))
            simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
    mask = np.zeros((K,K))
    for i in range(K):
        simdist_vec = np.reshape(simdist.T,(K**2,1))
        I = np.argmax(simdist_vec)
        xy = np.unravel_index(I,simdist.shape,order='F')
        x = xy[0]
        y = xy[1]
        mask[x,y] = 1
        simdist[x,:] = 0
        simdist[:,y] = 0
    swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
    swapI = swapIJ[0][1,:]
    swapJ = swapIJ[0][0,:]
    clstr2_temp = np.copy(clstr2)
    for k in range(swapI.shape[0]):
        swapj = [swapJ[k]==i for i in clstr2]
        clstr2_temp[swapj] = swapI[k]
    return clstr2_temp
    
    0 讨论(0)
  • 2021-01-18 13:33

    You have to write some code to parse the file. If you ignore the cluster, you should be able to distinguish between family, genera and species based on indentation.

    The easiest way it to define a named tuple:

    import collections
    Bacterium = collections.namedtuple('Bacterium', ['family', 'genera', 'species'])
    

    You can make in instance of this object like this:

    b = Bacterium('Brucellaceae', 'Brucella', 'canis')
    

    Your parser should read a file line by line, and set the family and genera. If it then finds a species, it should add a Bacterium to a list;

    with open('cluster0.txt', 'r') as infile:
        lines = infile.readlines()
    family = None
    genera = None
    bacteria = []
    for line in lines:
        # set family and genera.
        # if you detect a bacterium:
        bacteria.append(Bacterium(family, genera, species))
    

    Once you have a list of all bacteria in each file or cluster, you can select from all the bacteria like this:

    s = [b for b in bacteria if b.genera == 'Streptomycetaceae']
    
    0 讨论(0)
  • 2021-01-18 13:36

    Given:

    file1 = '''Cluster 0:
     giant(2)
      red(2)
       brick(1)
       apple(1)
    Cluster 1:
     tiny(3)
      green(1)
       dot(1)
      blue(2)
       flower(1)
       candy(1)'''.split('\n')
    file2 = '''Cluster 18:
     giant(2)
      red(2)
       brick(1)
       tomato(1)
    Cluster 19:
     tiny(2)
      blue(2)
       flower(1)
       candy(1)'''.split('\n')
    

    Is this what you need?

    def parse_file(open_file):
        result = []
    
        for line in open_file:
            indent_level = len(line) - len(line.lstrip())
            if indent_level == 0:
                levels = ['','','']
            item = line.lstrip().split('(', 1)[0]
            levels[indent_level - 1] = item
            if indent_level == 3:
                result.append('.'.join(levels))
        return result
    
    data1 = set(parse_file(file1))
    data2 = set(parse_file(file2))
    
    differences = [
        ('common elements', data1 & data2),
        ('missing from file2', data1 - data2),
        ('missing from file1', data2 - data1) ]
    

    To see the differences:

    for desc, items in differences:
        print desc
        print 
        for item in items:
            print '\t' + item
        print
    

    prints

    common elements
    
        giant.red.brick
        tiny.blue.candy
        tiny.blue.flower
    
    missing from file2
    
        tiny.green.dot
        giant.red.apple
    
    missing from file1
    
        giant.red.tomato
    
    0 讨论(0)
  • 2021-01-18 13:37

    So just for help, as I see lots of different answers in the comment, I'll give you a very, very simple implementation of a script that you can start from.

    Note that this does not answer your full question but points you in one of the directions in the comments.

    Normally if you have no experience I'd argue to go a head and read up on Python (which i'll do anyways, and i'll throw in a few links in the bottom of the answer)

    On to the fun stuffs! :)

    class Cluster(object):
      '''
      This is a class that will contain your information about the Clusters.
      '''
      def __init__(self, number):
        '''
        This is what some languages call a constructor, but it's not.
        This method initializes the properties with values from the method call.
        '''
        self.cluster_number = number
        self.family_name = None
        self.bacteria_name = None
        self.bacteria = []
    
    #This part below isn't a part of the class, this is the actual script.
    with open('bacteria.txt', 'r') as file:
      cluster = None
      clusters = []
      for index, line in enumerate(file):
        if line.startswith('Cluster'):
          cluster = Cluster(index)
          clusters.append(cluster)
        else:
          if not cluster.family_name:
            cluster.family_name = line
          elif not cluster.bacteria_name:
            cluster.bacteria_name = line
          else:
            cluster.bacteria.append(line)
    

    I wrote this as dumb and overly simple as I could without any fancy stuff and for Python 2.7.2 You could copy this file into a .py file and run it directly from command line python bacteria.py for example.

    Hope this helps a bit and don't hesitate to come by our Python chat room if you have any questions! :)

    • http://learnpythonthehardway.org/
    • http://www.diveintopython.net/
    • http://docs.python.org/2/tutorial/inputoutput.html
    • check if all elements in a list are identical
    • Retaining order while using Python's set difference
    0 讨论(0)
  • 2021-01-18 13:41

    Comparing two clusterings is not trivial task and reinventing the wheel is unlikely to be successful. Check out this package which has lots of different cluster similarity metrics and can compare dendrograms (the data structure you have).

    The library is called CluSim and can be found here: https://github.com/Hoosier-Clusters/clusim/

    0 讨论(0)
提交回复
热议问题