The documentation for sklearn.cluster.AgglomerativeClustering mentions that,
when varying the number of clusters and using caching, it may be advant
You set a cacheing directory with the paramater memory = 'mycachedir'
and then if you set compute_full_tree=True
, when you rerun fit
with different values of n_clusters
, it will used the cached tree rather than recomputing each time. To give you an example of how to do this with sklearn's gridsearch API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)