Recursive feature elimination on Random Forest using scikit-learn

后端 未结 4 1063
一向
一向 2020-12-28 18:06

I\'m trying to preform recursive feature elimination using scikit-learn and a random forest classifier, with OOB ROC as the method of scoring each subset create

相关标签:
4条回答
  • 2020-12-28 18:21

    Here's what I ginned up. It's a pretty simple solution, and relies on a custom accuracy metric (called weightedAccuracy) since I'm classifying a highly unbalanced dataset. But, it should be easily made more extensible if desired.

    from sklearn import datasets
    import pandas
    from sklearn.ensemble import RandomForestClassifier
    from sklearn import cross_validation
    from sklearn.metrics import confusion_matrix
    
    
    def get_enhanced_confusion_matrix(actuals, predictions, labels):
        """"enhances confusion_matrix by adding sensivity and specificity metrics"""
        cm = confusion_matrix(actuals, predictions, labels = labels)
        sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
        specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
        weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
        return cm, sensitivity, specificity, weightedAccuracy
    
    iris = datasets.load_iris()
    x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
    y=pandas.Series(iris.target, name='target')
    
    response, _  = pandas.factorize(y)
    
    xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
    print "building the first forest"
    rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
    rf.fit(xTrain, yTrain)
    importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
                                    }).sort(['imp'], ascending = False).reset_index(drop = True)
    
    cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
    numFeatures = len(x.columns)
    
    rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures], 
                                  'weightedAccuracy':[weightedAccuracy], 
                                  'sensitivity':[sensitivity], 
                                  'specificity':[specificity]})
    
    print "running RFE on  %d features"%numFeatures
    
    for i in range(1,numFeatures,1):
        varsUsed = importances['name'][0:i]
        print "now using %d of %s features"%(len(varsUsed), numFeatures)
        xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
        rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
                                    n_jobs = -1, verbose = 1)
        rf.fit(xTrain, yTrain)
        cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
        print("\n"+str(cm))
        print('the sensitivity is %d percent'%(sensitivity * 100))
        print('the specificity is %d percent'%(specificity * 100))
        print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
        rfeMatrix = rfeMatrix.append(
                                    pandas.DataFrame({'numFeatures':[len(varsUsed)], 
                                    'weightedAccuracy':[weightedAccuracy], 
                                    'sensitivity':[sensitivity], 
                                    'specificity':[specificity]}), ignore_index = True)    
    print("\n"+str(rfeMatrix))    
    maxAccuracy = rfeMatrix.weightedAccuracy.max()
    maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
    featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
    
    print "the final features used are %s"%featuresUsed
    
    0 讨论(0)
  • 2020-12-28 18:26

    This is my code, I've tidied it up a bit to make it relevant to your task:

    features_to_use = fea_cols #  this is a list of features
    # empty dataframe
    trim_5_df = DataFrame(columns=features_to_use)
    run=1
    # this will remove the 5 worst features determined by their feature importance computed by the RF classifier
    while len(features_to_use)>6:
        print('number of features:%d' % (len(features_to_use)))
        # build the classifier
        clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
        # train the classifier
        clf.fit(train[features_to_use], train['OpenStatusMod'].values)
        print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
        # predict the class and print the classification report, f1 micro, f1 macro score
        pred = clf.predict(test[features_to_use])
        print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
        print('micro score: ')
        print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
        print('macro score:\n')
        print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
        # predict the class probabilities
        probs = clf.predict_proba(test[features_to_use])
        # rescale the priors
        new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
        # calculate logloss with the rescaled probabilities
        print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
        row={}
        if hasattr(clf, "feature_importances_"):
            # sort the features by importance
            sorted_idx = np.argsort(clf.feature_importances_)
            # reverse the order so it is descending
            sorted_idx = sorted_idx[::-1]
            # add to dataframe
            row['num_features'] = len(features_to_use)
            row['features_used'] = ','.join(features_to_use)
            # trim the worst 5
            sorted_idx = sorted_idx[: -5]
            # swap the features list with the trimmed features
            temp = features_to_use
            features_to_use=[]
            for feat in sorted_idx:
                features_to_use.append(temp[feat])
            # add the logloss performance
            row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
        print('')
        # add the row to the dataframe
        trim_5_df = trim_5_df.append(DataFrame(row))
    run +=1
    

    So what I'm doing here is I have a list of features I want to train and then predict against, using the feature importances I then trim the worst 5 and repeat. During each run I add a row to record the prediction performance so that I can do some analysis later.

    The original code was much bigger I had different classifiers and datasets I was analysing but I hope you get the picture from the above. The thing I noticed was that for random forest the number of features I removed on each run affected the performance so trimming by 1, 3 and 5 features at a time resulted in a different set of best features.

    I found that using a GradientBoostingClassifer was more predictable and repeatable in the sense that the final set of best features agreed whether I trimmed 1 feature at a time or 3 or 5.

    I hope I'm not teaching you to suck eggs here, you probably know more than me, but my approach to ablative anlaysis was to use a fast classifier to get a rough idea of the best sets of features, then use a better performing classifier, then start hyper parameter tuning, again doing coarse grain comaprisons and then fine grain once I get a feel of what the best params were.

    0 讨论(0)
  • 2020-12-28 18:26

    I submitted a request to add coef_ so RandomForestClassifier may be used with RFECV. However, the change had already been made. This change will be in version 0.17.

    https://github.com/scikit-learn/scikit-learn/issues/4945

    You can pull the latest dev build if you want to use it now.

    0 讨论(0)
  • 2020-12-28 18:44

    Here's what I've done to adapt RandomForestClassifier to work with RFECV:

    class RandomForestClassifierWithCoef(RandomForestClassifier):
        def fit(self, *args, **kwargs):
            super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
            self.coef_ = self.feature_importances_
    

    Just using this class does the trick if you use 'accuracy' or 'f1' score. For 'roc_auc', RFECV complains that multiclass format is not supported. Changing it to two-class classification with the code below, the 'roc_auc' scoring works. (Using Python 3.4.1 and scikit-learn 0.15.1)

    y=(pd.Series(iris.target, name='target')==2).astype(int)
    

    Plugging into your code:

    from sklearn import datasets
    import pandas as pd
    from pandas import Series
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.feature_selection import RFECV
    
    class RandomForestClassifierWithCoef(RandomForestClassifier):
        def fit(self, *args, **kwargs):
            super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
            self.coef_ = self.feature_importances_
    
    iris = datasets.load_iris()
    x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
    y=(pd.Series(iris.target, name='target')==2).astype(int)
    rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
    rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
    selector=rfecv.fit(x, y)
    
    0 讨论(0)
提交回复
热议问题