问题
The documentation for sklearn.cluster.AgglomerativeClustering mentions that,
when varying the number of clusters and using caching, it may be advantageous to compute the full tree.
This seems to imply that it is possible to first compute the full tree, and then quickly update the number of desired clusters as necessary, without recomputing the tree (with caching).
However this procedure for changing the number of clusters does not seem to be documented. I would like to do this but am unsure how to proceed.
Update: To clarify, the fit method does not take number of clusters as an input: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering.fit
回答1:
You set a cacheing directory with the paramater memory = 'mycachedir'
and then if you set compute_full_tree=True
, when you rerun fit
with different values of n_clusters
, it will used the cached tree rather than recomputing each time. To give you an example of how to do this with sklearn's gridsearch API:
from sklearn.cluster import AgglomerativeClustering
from sklearn.grid_search import GridSearchCV
ac = AgglomerativeClustering(memory='mycachedir',
compute_full_tree=True)
classifier = GridSearchCV(ac,
{n_clusters: range(2,6)},
scoring = 'adjusted_rand_score',
n_jobs=-1, verbose=2)
classifier.fit(X,y)
来源:https://stackoverflow.com/questions/36490241/sklearn-agglomerative-clustering-dynamically-updating-the-number-of-clusters