问题
I have a distance matrix with about 5000 entries, and use scipy's hierarchical clustering methods to cluster the matrix. The code I use for this is the following snippet:
Y = fastcluster.linkage(D, method='centroid') # D-distance matrix
Z1 = sch.dendrogram(Y,truncate_mode='level', p=7,show_contracted=True)
Since the dendrogram will become rather dense with all this data, I use the truncate_mode to prune it a bit. All of this works, but I wonder how I can find out which of the original 5000 entries belong to a particular branch in the dendrogram.
I tried using
leaves = sch.leaves_list(Y)
to get a list of leaves, but this uses the linkage output as indata, and while I can see the correspondence between the pruned dendrogram and the leaves-list, it becomes a bit cumbersome to map original entries manually to the dendrogram.
To summarize: Is there a way of listing all the original entries in the distance matrix that belongs to a branch in a pruned dendrogram? Or are there other methods of doing this that I am not aware of.
Thanks
回答1:
One of the dictionary data-structures returned by scipy.cluster.hierarchy.dendrogram has the key ivl
, that the documentation describes as:
a list of labels corresponding to the leaf nodes
You can supply custom labels (using labels=<array of lables>
) as input to the dendrogram function but by default, they are just indices of the original observation. By comparing the original labels/indices and Z1['ivl']
, you can determine what the original entries were.
来源:https://stackoverflow.com/questions/10305111/pruning-dendrogram-in-scipy-hierarchical-clustering