Scikit-learn: How do we define a distance metric's parameter for grid search

问题

I have following code snippet that attempts to do a grid search in which one of the grid parameters are the distance metrics to be used for the KNN algorithm. The example below fails if I use "wminkowski", "seuclidean" or "mahalanobis" distances metrics.

# Define the parameter values that should be searched
k_range    = range(1,31)
weights    = ['uniform' , 'distance']
algos      = ['auto', 'ball_tree', 'kd_tree', 'brute']
leaf_sizes = range(10, 60, 10)    
metrics = ["euclidean", "manhattan", "chebyshev", "minkowski", "mahalanobis"]

param_grid = dict(n_neighbors = list(k_range), weights = weights, algorithm = algos, leaf_size = list(leaf_sizes), metric=metrics)
param_grid

# Instantiate the algorithm
knn = KNeighborsClassifier(n_neighbors=10)

# Instantiate the grid
grid = GridSearchCV(knn, param_grid=param_grid, cv=10, scoring='accuracy', n_jobs=-1)

# Fit the models using the grid parameters
grid.fit(X,y)

I assume this is because I have to set or define the ranges for the various distance parameters (for example p, w for “wminkowski” - WMinkowskiDistance ). The "minkowski" distance may be working because its "p" parameter has the default 2.

So my questions are:

Can we set the range of parameters for the distance metrics for the grid search and if so how?
Can we set the value of a parameters for the distance metrics for the grid search and if so how?

Hope the question is clear. TIA

回答1:

I finally got the answer with the help from the Scikit user and developer mailing list. I am placing here what I learned in the hopes that it will help other too.

The answer to the two questions above is: yes. This is the example code I got from the mailing list:

params = [{'kernel':['poly'],'degree':[1,2,3],'gamma':[1/p,1,2],'coef0':[-1,0,1]},
          {'kernel':['rbf'],'gamma':[1/p,1,2]},
          {'kernel':['sigmoid'],'gamma':[1/p,1,2],'coef0':[-1,0,1]}]

Two things to note:

You can list a set of parameters, for each set you are free to place only what is required for the group of parameters. This means we can select the metric and the corresponding parameters. The parameters are named by using the keys.
For each of the keys we can use a list of values - each combination of these values will be use by the grid search and passed on to the corresponding metric function.

This still leaves us with an issues: how do we pass the combination of parameters to the metric. Note: not all metrics can be used by an algorithm, so you have to set these manually.

I now show the example I requested above:

{'metric': ['wminkowski'], 
                     'metric_params':[
                                {'w':np.array([2.0] * len(X.columns)),'p':1.0},   # L1
                                {'w':np.array([2.0] * len(X.columns)),'p':1.5},
                                {'w':np.array([2.0] * len(X.columns)),'p':2.0},   # L2
                                {'w':np.array([2.0] * len(X.columns)),'p':2.5},
                                {'w':np.array([2.0] * len(X.columns)),'p':3.5},
                                {'w':np.array([2.0] * len(X.columns)),'p':3.0}
                               ], 
                     'algorithm': ['brute', 'ball_tree'], 
                     'n_neighbors': list(k_range), 'weights': weights, 'leaf_size': list(leaf_sizes) }

Note the following:

'wminkowski' only works with the ['brute', 'ball_tree'] algorithms.
We must use a list of dictionaries in 'metric_params' in order to enumerate all the possible combinations of parameters (I have not found way to automate this).
In the case above I was forced to use a numpy array because the conversion was not made implicitly (otherwise we get an exception)

I anyone know of a better way of doing this, please comment.

来源：https://stackoverflow.com/questions/37924606/scikit-learn-how-do-we-define-a-distance-metrics-parameter-for-grid-search

标签

parameters

scikit-learn

distance

metrics

grid-search