问题
I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score
which works fine.
My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV
/RandomizedSearchCV
, but I can't find a simple GridSearch
/RandomizedSearch
. I can write my own but the ParameterSampler
and ParameterGrid
objects are very useful.
My next step will be to subclass BaseSearchCV
and implement my own _fit()
method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv
parameter?
def silhouette_score(estimator, X):
clusters = estimator.fit_predict(X)
score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
return score
ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}
# run randomized search
search = GridSearchCV(
ca,
param_distributions=param_dist,
n_iter=n_iter_search,
scoring=silhouette_score,
cv= # can I pass something here to only use a single fold?
)
search.fit(distance_matrix)
回答1:
Ok, this might be an old question but I use this kind of code:
First, we want to generate all the possible combinations of parameters:
def make_generator(parameters):
if not parameters:
yield dict()
else:
key_to_iterate = list(parameters.keys())[0]
next_round_parameters = {p : parameters[p]
for p in parameters if p != key_to_iterate}
for val in parameters[key_to_iterate]:
for pars in make_generator(next_round_parameters):
temp_res = pars
temp_res[key_to_iterate] = val
yield temp_res
Then create a loop out of this:
# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 }
param_grid = {"n_clusters": range(2, 11)}
for params in make_generator(param_grid):
params.update(fixed_params)
ca = KMeans( **params )
ca.fit(_data)
labels = ca.labels_
# Estimate your clustering labels and
# make decision to save or discard it!
Of course, it can be combined in a pretty function. So this solution is mostly an example.
Hope it helps someone!
回答2:
Recently I ran into similar problem. I defined custom iterable cv_custom
which defines splitting strategy and is an input for cross validation parameter cv
. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ...
In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])
N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)
来源:https://stackoverflow.com/questions/34611038/grid-search-for-hyperparameter-evaluation-of-clustering-in-scikit-learn