Recursive feature elimination on Random Forest using scikit-learn

后端 未结 4 1064
一向 2020-12-28 18:06

I\'m trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset create

  • 2020-12-28 18:21

    Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.

    from sklearn import datasets
    import pandas
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import cross_validation
    from sklearn.metrics import confusion_matrix
    def get_enhanced_confusion_matrix(actuals, predictions, labels):
        """"enhances confusion_matrix by adding sensivity and specificity metrics"""
        cm = confusion_matrix(actuals, predictions, labels = labels)
        sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
        specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
        weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
        return cm, sensitivity, specificity, weightedAccuracy
    iris = datasets.load_iris()
    x=pandas.DataFrame(, columns=['var1','var2','var3', 'var4'])
    y=pandas.Series(, name='target')
    response, _  = pandas.factorize(y)
    xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
    print "building the first forest"
    rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1), yTrain)
    importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
                                    }).sort(['imp'], ascending = False).reset_index(drop = True)
    cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
    numFeatures = len(x.columns)
    rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures], 
    print "running RFE on  %d features"%numFeatures
    for i in range(1,numFeatures,1):
        varsUsed = importances['name'][0:i]
        print "now using %d of %s features"%(len(varsUsed), numFeatures)
        xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
        rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
                                    n_jobs = -1, verbose = 1), yTrain)
        cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
        print('the sensitivity is %d percent'%(sensitivity * 100))
        print('the specificity is %d percent'%(specificity * 100))
        print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
        rfeMatrix = rfeMatrix.append(
                                    'specificity':[specificity]}), ignore_index = True)    
    maxAccuracy = rfeMatrix.weightedAccuracy.max()
    maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
    featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
    print "the final features used are %s"%featuresUsed
    0 讨论(0)
  • 2020-12-28 18:26

    This is my code, I've tidied it up a bit to make it relevant to your task:

    features_to_use = fea_cols #  this is a list of features
    # empty dataframe
    trim_5_df = DataFrame(columns=features_to_use)
    # this will remove the 5 worst features determined by their feature importance computed by the RF classifier
    while len(features_to_use)>6:
        print('number of features:%d' % (len(features_to_use)))
        # build the classifier
        clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
        # train the classifier[features_to_use], train['OpenStatusMod'].values)
        print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
        # predict the class and print the classification report, f1 micro, f1 macro score
        pred = clf.predict(test[features_to_use])
        print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
        print('micro score: ')
        print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
        print('macro score:\n')
        print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
        # predict the class probabilities
        probs = clf.predict_proba(test[features_to_use])
        # rescale the priors
        new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
        # calculate logloss with the rescaled probabilities
        print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
        if hasattr(clf, "feature_importances_"):
            # sort the features by importance
            sorted_idx = np.argsort(clf.feature_importances_)
            # reverse the order so it is descending
            sorted_idx = sorted_idx[::-1]
            # add to dataframe
            row['num_features'] = len(features_to_use)
            row['features_used'] = ','.join(features_to_use)
            # trim the worst 5
            sorted_idx = sorted_idx[: -5]
            # swap the features list with the trimmed features
            temp = features_to_use
            for feat in sorted_idx:
            # add the logloss performance
            row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
        # add the row to the dataframe
        trim_5_df = trim_5_df.append(DataFrame(row))
    run +=1

    So what I'm doing here is I have a list of features I want to train and then predict against, using the feature importances I then trim the worst 5 and repeat. During each run I add a row to record the prediction performance so that I can do some analysis later.

    The original code was much bigger I had different classifiers and datasets I was analysing but I hope you get the picture from the above. The thing I noticed was that for random forest the number of features I removed on each run affected the performance so trimming by 1, 3 and 5 features at a time resulted in a different set of best features.

    I found that using a GradientBoostingClassifer was more predictable and repeatable in the sense that the final set of best features agreed whether I trimmed 1 feature at a time or 3 or 5.

    I hope I'm not teaching you to suck eggs here, you probably know more than me, but my approach to ablative anlaysis was to use a fast classifier to get a rough idea of the best sets of features, then use a better performing classifier, then start hyper parameter tuning, again doing coarse grain comaprisons and then fine grain once I get a feel of what the best params were.

    0 讨论(0)
  • 2020-12-28 18:26

    I submitted a request to add coef_ so RandomForestClassifier may be used with RFECV. However, the change had already been made. This change will be in version 0.17.

    You can pull the latest dev build if you want to use it now.

    0 讨论(0)
  • 2020-12-28 18:44

    Here's what I've done to adapt RandomForestClassifier to work with RFECV:

    class RandomForestClassifierWithCoef(RandomForestClassifier):
        def fit(self, *args, **kwargs):
            super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
            self.coef_ = self.feature_importances_

    Just using this class does the trick if you use 'accuracy' or 'f1' score. For 'roc_auc', RFECV complains that multiclass format is not supported. Changing it to two-class classification with the code below, the 'roc_auc' scoring works. (Using Python 3.4.1 and scikit-learn 0.15.1)

    y=(pd.Series(, name='target')==2).astype(int)

    Plugging into your code:

    from sklearn import datasets
    import pandas as pd
    from pandas import Series
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import RFECV
    class RandomForestClassifierWithCoef(RandomForestClassifier):
        def fit(self, *args, **kwargs):
            super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
            self.coef_ = self.feature_importances_
    iris = datasets.load_iris()
    x=pd.DataFrame(, columns=['var1','var2','var3', 'var4'])
    y=(pd.Series(, name='target')==2).astype(int)
    rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
    rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2), y)
    0 讨论(0)