Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.
hcluster
The problem probably is that they will try to compute the full 2D distance matrix (about 8 GB naively with double precision) and then their algorithm will run in O(n^3)
time anyway.
You should seriously consider using a different clustering algorithm. Hierarchical clustering is slow and the results are not at all convincing usually. In particular for millions of objects, where you can't just look at the dendrogram to choose the appropriate cut.
If you really want to continue hierarchical clustering, I belive that ELKI (Java though) has a O(n^2)
implementation of SLINK
. Which at 1 million objects should be approximately 1 million times as fast. I don't know if they already have CLINK
, too. And I'm not sure if there actually is any sub-O(n^3)
algorithm for other variants than single-link and complete-link.
Consider using other algorithms. k-means for example scales very well with the number of objects (it's just not very good usually either, unless your data is very clean and regular). DBSCAN
and OPTICS
are quite good in my opinion, once you have a feel for the parameters. If your data set is low dimensional, they can be accelerated quite well with an appropriate index structure. They should then run in O(n log n)
, if you have an index with O(log n)
query time. Which can make a huge difference for large data sets. I've personally used OPTICS
on a 110k images data set without problems, so I can imagine it scales up well to 1 million on your system.
To beat O(n^2), you'll have to first reduce your 1M points (documents)
to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
Two possible approaches:
build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth
first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.
How well either of these might work depends critically
on the size and shape of your target tree --
how many levels, how many leaves ?
What software are you using,
and how many hours / days do you have to do the clustering ?
For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?
Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.
Added: see also
google-all-pairs-similarity-search
"Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
SO hierarchical-clusterization-heuristics