Hierarchical clustering of 1 million objects

后端 未结 2 1034
轮回少年
轮回少年 2021-01-30 14:29

Can anyone point me to a hierarchical clustering tool (preferable in python) that can cluster ~1 Million objects? I have tried hcluster and also Orange.

hcluster

2条回答
  •  [愿得一人]
    2021-01-30 15:27

    To beat O(n^2), you'll have to first reduce your 1M points (documents) to e.g. 1000 piles of 1000 points each, or 100 piles of 10k each, or ...
    Two possible approaches:

    • build a hierarchical tree from say 15k points, then add the rest one by one: time ~ 1M * treedepth

    • first build 100 or 1000 flat clusters, then build your hierarchical tree of the 100 or 1000 cluster centres.

    How well either of these might work depends critically on the size and shape of your target tree -- how many levels, how many leaves ?
    What software are you using, and how many hours / days do you have to do the clustering ?

    For the flat-cluster approach, K-d_tree s work fine for points in 2d, 3d, 20d, even 128d -- not your case. I know hardly anything about clustering text; Locality-sensitive_hashing ?

    Take a look at scikit-learn clustering -- it has several methods, including DBSCAN.

    Added: see also
    google-all-pairs-similarity-search "Algorithms for finding all similar pairs of vectors in sparse vector data", Beyardo et el. 2007
    SO hierarchical-clusterization-heuristics

提交回复
热议问题