Calculate ordering of dendrogram leaves

问题

I have five points and I need to create dendrogram from these. The function 'dendrogram' can be used to find the ordering of these points as shown below. However, I do not want to use dendrogram as it is slow and result in error for large number of points (I asked this question here Python alternate way to find dendrogram). Can someone points me how to convert the 'linkage' output (Z) to the "dendrogram(Z)['ivl']" value.

>>> from hcluster import pdist, linkage, dendrogram
>>> import numpy
>>> from numpy.random import rand
>>> x = rand(5,3)
>>> Y = pdist(x)
>>> Z = linkage(Y)
>>> Z
array([[ 1.        ,  3.        ,  0.11443378,  2.        ],
       [ 0.        ,  4.        ,  0.47941843,  2.        ],
       [ 5.        ,  6.        ,  0.67596472,  4.        ],
       [ 2.        ,  7.        ,  0.79993986,  5.        ]])
>>> 



>>> dendrogram(Z)['ivl']
['2', '1', '3', '0', '4']
>>>

回答1:

Why is it slow? Sure, the naive way of computing linkage clustering is O(n^3) but for n=5 this is as cheap as it gets...

For the format of the scipy linkage matrix, see this question: scipy linkage format

Note that you may still need to sort the data optimally. The linkage matrix encoding above gives

Element 1 and Cluster 3 join at height 0.1144 (into a 2 element cluster, #5)
Element 0 and Cluster 4 join at height 0.7999 (into a 2 element cluster, #6)
Cluster 5 and Cluster 6 join at height 0.6759 (into a 4 element cluster, #7)
Element 2 and Cluster 7 join at height 0.7999 (into a 5 element cluster, #8)

but it might be ordered by linking distance, and not in a 1d ordering for visualization (because not everbody using linkage clustering will want to run dendrogram viusalization afterwards). But in any way, computing the dendrogram should be on the order of O(n log n) if you do need to sort, fairly cheap compared to the actual clustering.

Something along these lines should do the trick:

n = len(Z) + 1
cache = dict()
for k in range(len(Z)):
  c1, c2 = int(Z[k][0]), int(Z[k][1])
  c1 = [c1] if c1 < n else cache.pop(c1)
  c2 = [c2] if c2 < n else cache.pop(c2)
  cache[n+k] = c1 + c2
print cache[2*len(Z)]

This may appear to be linear, but the expected size of the arrays is log n, so depending on your list types it may still be O(n log n), while with linked lists it should indeed be doable in O(n).

But in the end, you might want to avoid hierarchical clustering. It is a popular introductory example to cluster analysis, because it is really easy to understand conceptually. There are some quite tricky algorithms (SLINK) to get it down to O(n^2) complexity. But there are more modern and powerful clustering algorithms that have lower complexity. Actually, OPTICS (Wikipedia) computes something quite similar (when you set minPts=2), and when you have a good index structure it will run in O(n log n). Plus you can increase minPts to get more meaningful clusters. (But do not use OPTICS in Weka, or that python version that is floating around - afaict they are both incomplete or buggy!)

回答2:

There is a dedicated function for calculating linearized leaf orders in scipy. Here it is. scipy.cluster.hierarchy.leaves_list.

来源：https://stackoverflow.com/questions/12572436/calculate-ordering-of-dendrogram-leaves

标签

python

data-mining

cluster-analysis

dendrogram