AUC-base Features Importance using Random Forest

后端 未结 2 1638
野的像风
野的像风 2021-02-09 15:36

I\'m trying to predict a binary variable with both random forests and logistic regression. I\'ve got heavily unbalanced classes (approx 1.5% of Y=1).

The default feature

2条回答
  •  别跟我提以往
    2021-02-09 16:05

    After doing some researchs, this is what I came out with :

    from sklearn.cross_validation import ShuffleSplit
    from collections import defaultdict
    
    names = db_train.iloc[:,1:].columns.tolist()
    
    # -- Gridsearched parameters
    model_rf = RandomForestClassifier(n_estimators=500,
                                     class_weight="auto",
                                     criterion='gini',
                                     bootstrap=True,
                                     max_features=10,
                                     min_samples_split=1,
                                     min_samples_leaf=6,
                                     max_depth=3,
                                     n_jobs=-1)
    scores = defaultdict(list)
    
    # -- Fit the model (could be cross-validated)
    rf = model_rf.fit(X_train, Y_train)
    acc = roc_auc_score(Y_test, rf.predict(X_test))
    
    for i in range(X_train.shape[1]):
        X_t = X_test.copy()
        np.random.shuffle(X_t[:, i])
        shuff_acc = roc_auc_score(Y_test, rf.predict(X_t))
        scores[names[i]].append((acc-shuff_acc)/acc)
    
    print("Features sorted by their score:")
    print(sorted([(round(np.mean(score), 4), feat) for
                  feat, score in scores.items()], reverse=True))
    
    Features sorted by their score:
    [(0.0028999999999999998, 'Var1'), (0.0027000000000000001, 'Var2'), (0.0023999999999999998, 'Var3'), (0.0022000000000000001, 'Var4'), (0.0022000000000000001, 'Var5'), (0.0022000000000000001, 'Var6'), (0.002, 'Var7'), (0.002, 'Var8'), ...]
    

    The output is not very sexy, but you got the idea. The weakness of this approach is that feature importance seems to be very parameters dependent. I ran it using differents params (max_depth, max_features..) and I'm getting a lot different results. So I decided to run a gridsearch on parameters (scoring = 'roc_auc') and then apply this VIM (Variable Importance Measure) to the best model.

    I took my inspiration from this (great) notebook.

    All suggestions/comments are most welcome !

提交回复
热议问题