AUC-base Features Importance using Random Forest

后端 未结 2 1645
野的像风
野的像风 2021-02-09 15:36

I\'m trying to predict a binary variable with both random forests and logistic regression. I\'ve got heavily unbalanced classes (approx 1.5% of Y=1).

The default feature

相关标签:
2条回答
  • 2021-02-09 16:05

    After doing some researchs, this is what I came out with :

    from sklearn.cross_validation import ShuffleSplit
    from collections import defaultdict
    
    names = db_train.iloc[:,1:].columns.tolist()
    
    # -- Gridsearched parameters
    model_rf = RandomForestClassifier(n_estimators=500,
                                     class_weight="auto",
                                     criterion='gini',
                                     bootstrap=True,
                                     max_features=10,
                                     min_samples_split=1,
                                     min_samples_leaf=6,
                                     max_depth=3,
                                     n_jobs=-1)
    scores = defaultdict(list)
    
    # -- Fit the model (could be cross-validated)
    rf = model_rf.fit(X_train, Y_train)
    acc = roc_auc_score(Y_test, rf.predict(X_test))
    
    for i in range(X_train.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
    
    print("Features sorted by their score:")
    print(sorted([(round(np.mean(score), 4), feat) for
                  feat, score in scores.items()], reverse=True))
    
    Features sorted by their score:
    [(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
    

    The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

    I took my inspiration from this (great) notebook.

    All suggestions/comments are most welcome !

    0 讨论(0)
  • 2021-02-09 16:24

    scoring is just a performance evaluation tool used in test sample, and it does not enter into the internal DecisionTreeClassifier algo at each split node. You can only specify the criterion (kind of internal loss function at each split node) to be either gini or information entropy for the tree algo.

    scoring can be used in a cross-validation context where the goal is to tune some hyperparameters (like max_depth). In your case, you can use a GridSearchCV to tune some of your hyperparameters using the scoring function roc_auc.

    0 讨论(0)
提交回复
热议问题