Grid search for hyperparameter evaluation of clustering in scikit-learn

问题

I'm clustering a sample of about 100 records (unlabelled) and trying to use grid_search to evaluate the clustering algorithm with various hyperparameters. I'm scoring using silhouette_score which works fine.

My problem here is that I don't need to use the cross-validation aspect of the GridSearchCV/RandomizedSearchCV, but I can't find a simple GridSearch/RandomizedSearch. I can write my own but the ParameterSampler and ParameterGrid objects are very useful.

My next step will be to subclass BaseSearchCV and implement my own _fit() method, but thought it was worth asking is there a simpler way to do this, for example by passing something to the cv parameter?

def silhouette_score(estimator, X):
    clusters = estimator.fit_predict(X)
    score = metrics.silhouette_score(distance_matrix, clusters, metric='precomputed')
    return score

ca = KMeans()
param_grid = {"n_clusters": range(2, 11)}

# run randomized search
search = GridSearchCV(
    ca,
    param_distributions=param_dist,
    n_iter=n_iter_search,
    scoring=silhouette_score,
    cv= # can I pass something here to only use a single fold?
    )
search.fit(distance_matrix)

回答1:

Ok, this might be an old question but I use this kind of code:

First, we want to generate all the possible combinations of parameters:

def make_generator(parameters):
    if not parameters:
        yield dict()
    else:
        key_to_iterate = list(parameters.keys())[0]
        next_round_parameters = {p : parameters[p]
                    for p in parameters if p != key_to_iterate}
        for val in parameters[key_to_iterate]:
            for pars in make_generator(next_round_parameters):
                temp_res = pars
                temp_res[key_to_iterate] = val
                yield temp_res

Then create a loop out of this:

# add fix parameters - here - it's just a random one
fixed_params = {"max_iter":300 } 

param_grid = {"n_clusters": range(2, 11)}

for params in make_generator(param_grid):
    params.update(fixed_params)
    ca = KMeans( **params )
    ca.fit(_data)
    labels = ca.labels_
    # Estimate your clustering labels and 
    # make decision to save or discard it!

Of course, it can be combined in a pretty function. So this solution is mostly an example.

Hope it helps someone!

回答2:

Recently I ran into similar problem. I defined custom iterable cv_custom which defines splitting strategy and is an input for cross validation parameter cv. This iterable should contain one couple for each fold with samples identified by their indices, e.g. ([fold1_train_ids], [fold1_test_ids]), ([fold2_train_ids], [fold2_test_ids]), ... In our case, we need just one couple for one fold with indices of all examples in the train and also in the test part ([train_ids], [test_ids])

N = len(distance_matrix)
cv_custom = [(range(0,N), range(0,N))]
scores = cross_val_score(clf, X, y, cv=cv_custom)

来源：https://stackoverflow.com/questions/34611038/grid-search-for-hyperparameter-evaluation-of-clustering-in-scikit-learn

标签

python

scikit-learn

cluster-analysis

scoring