sklearn - Cross validation with multiple scores

前端 未结 5 1447
半阙折子戏
半阙折子戏 2020-12-23 17:46

I would like to compute the recall, precision and f-measure of a cross validation test for different classifiers. scik

相关标签:
5条回答
  • 2020-12-23 18:26

    You could use this:

    from sklearn import metrics
    from multiscorer import MultiScorer
    import numpy as np
    
    scorer = MultiScorer({
        'F-measure' : (f1_score, {...}),
        'Precision' : (precision_score, {...}),
        'Recall' : (recall_score, {...})
    })
    
    ...
    
    cross_val_score(clf, X, target, scoring=scorer)
    results = scorer.get_results()
    
    for name in results.keys():
         print '%s: %.4f' % (name, np.average(results[name]) )
    

    The source of multiscorer is on Github

    0 讨论(0)
  • 2020-12-23 18:31

    You can use the following code in order to compute Accuracy, Precision, Recall and any other metrics by fitting your estimator only once per cross-validation step.

    def get_true_and_pred_CV(estimator, X, y, n_folds, cv, params):
        ys = []
        for train_idx, valid_idx in cv:
            clf = estimator(**params)
            if isinstance(X, np.ndarray):
                clf.fit(X[train_idx], y[train_idx])
                cur_pred = clf.predict(X[valid_idx])
            elif isinstance(X, pd.DataFrame):
                clf.fit(X.iloc[train_idx, :], y[train_idx]) 
                cur_pred = clf.predict(X.iloc[valid_idx, :])
            else:
                raise Exception('Only numpy array and pandas DataFrame ' \
                                'as types of X are supported')
    
            ys.append((y[valid_idx], cur_pred))
        return ys
    
    
    def fit_and_score_CV(estimator, X, y, n_folds=10, stratify=True, **params):
        if not stratify:
            cv_arg = sklearn.cross_validation.KFold(y.size, n_folds)
        else:
            cv_arg = sklearn.cross_validation.StratifiedKFold(y, n_folds)
    
        ys = get_true_and_pred_CV(estimator, X, y, n_folds, cv_arg, params)    
        cv_acc = map(lambda tp: sklearn.metrics.accuracy_score(tp[0], tp[1]), ys)
        cv_pr_weighted = map(lambda tp: sklearn.metrics.precision_score(tp[0], tp[1], average='weighted'), ys)
        cv_rec_weighted = map(lambda tp: sklearn.metrics.recall_score(tp[0], tp[1], average='weighted'), ys)
        cv_f1_weighted = map(lambda tp: sklearn.metrics.f1_score(tp[0], tp[1], average='weighted'), ys)
    
        # the approach below makes estimator fit multiple times
        #cv_acc = sklearn.cross_validation.cross_val_score(algo, X, y, cv=cv_arg, scoring='accuracy')
        #cv_pr_weighted = sklearn.cross_validation.cross_val_score(algo, X, y, cv=cv_arg, scoring='precision_weighted')
        #cv_rec_weighted = sklearn.cross_validation.cross_val_score(algo, X, y, cv=cv_arg, scoring='recall_weighted')   
        #cv_f1_weighted = sklearn.cross_validation.cross_val_score(algo, X, y, cv=cv_arg, scoring='f1_weighted')
        return {'CV accuracy': np.mean(cv_acc), 'CV precision_weighted': np.mean(cv_pr_weighted),
                'CV recall_weighted': np.mean(cv_rec_weighted), 'CV F1_weighted': np.mean(cv_f1_weighted)}
    

    I frequently use these functions instead of cross_val_score to compute multiple statistics altogether. You can change quality metrics by the desired.

    0 讨论(0)
  • 2020-12-23 18:37

    Now in scikit-learn: cross_validate is a new function that can evaluate a model on multiple metrics. This feature is also available in GridSearchCV and RandomizedSearchCV (doc). It has been merged recently in master and will be available in v0.19.

    From the scikit-learn doc:

    The cross_validate function differs from cross_val_score in two ways: 1. It allows specifying multiple metrics for evaluation. 2. It returns a dict containing training scores, fit-times and score-times in addition to the test score.

    The typical use case goes by:

    from sklearn.svm import SVC
    from sklearn.datasets import load_iris
    from sklearn.model_selection import cross_validate
    iris = load_iris()
    scoring = ['precision', 'recall', 'f1']
    clf = SVC(kernel='linear', C=1, random_state=0)
    scores = cross_validate(clf, iris.data, iris.target == 1, cv=5,
                            scoring=scoring, return_train_score=False)
    

    See also this example.

    0 讨论(0)
  • 2020-12-23 18:40

    The solution you present represents exactly the functionality of cross_val_score, perfectly adapted to your situation. It seems like the right way to go.

    cross_val_score takes the argument n_jobs=, making the evaluation parallelizeable. If this is something you need, you should look into replacing your for loop with a parallel loop, using sklearn.externals.joblib.Parallel.

    On a more general note, a discussion is going on about the problem of multiple scores in the issue tracker of scikit learn. A representative thread can be found here. So while it looks like future versions of scikit-learn will permit multiple outputs of scorers, as of now, this is impossible.

    A hacky (disclaimer!) way to get around this is to change the code in cross_validation.py ever so slightly, by removing a condition check on whether your score is a number. However, this suggestion is very version dependent, so I will present it for version 0.14.

    1) In IPython, type from sklearn import cross_validation, followed by cross_validation??. Note the filename that is displayed and open it in an editor (you may need root priviliges).

    2) You will find this code, where I have already tagged the relevant line (1066). It says

        if not isinstance(score, numbers.Number):
            raise ValueError("scoring must return a number, got %s (%s)"
                             " instead." % (str(score), type(score)))
    

    These lines need to be removed. In order to keep track of what was there once (if ever you want to change back), replace it with the following

        if not isinstance(score, numbers.Number):
            pass
            # raise ValueError("scoring must return a number, got %s (%s)"
            #                 " instead." % (str(score), type(score)))
    

    If what your scorer returns doesn't make cross_val_score choke elsewhere, this should resolve your issue. Please let me know if this is the case.

    0 讨论(0)
  • 2020-12-23 18:40

    This might be helpful if you looking multiple-metrics with multi-classes. With the latest doc in scikit learn 0.19 and above; you can pass your own dictionary with metric functions;

    custom_scorer = {'accuracy': make_scorer(accuracy_score),
                     'balanced_accuracy': make_scorer(balanced_accuracy_score),
                     'precision': make_scorer(precision_score, average='macro'),
                     'recall': make_scorer(recall_score, average='macro'),
                     'f1': make_scorer(f1_score, average='macro'),
                     }
    scores = cross_validation.cross_val_score(clf, X_train, y_train,
            cv = 10, scoring = custom_scorer)
    
    
    0 讨论(0)
提交回复
热议问题