Recursive feature selection may not yield higher performance?

假装没事ソ 提交于 2020-05-09 07:53:28

问题


I'm tring to analyze below data, modeled it with logistic regression first and then did the prediction, calculated the accuracy & auc; then performed recursive feature selection and calculated accuracy & auc again, thought the accuracy and auc would be higher, but actually they are both lower after the recursive feature selection, not sure whether it's expected? Or did I miss something? Thanks!

Data: https://github.com/amandawang-dev/census-training/blob/master/census-training.csv


for logistic regression, Accuracy: 0.8111649491571692; AUC: 0.824896256487386

after recursive feature selection, Accuracy: 0.8130075752405651; AUC: 0.7997315631730443

import pandas as pd
import numpy as np
from sklearn import preprocessing, metrics
from sklearn.model_selection import train_test_split


train=pd.read_csv('census-training.csv')
train = train.replace('?', np.nan)
for column in train.columns:
    train[column].fillna(train[column].mode()[0], inplace=True)
x['Income'] = x['Income'].str.contains('>50K').astype(int)
x['Gender'] = x['Gender'].str.contains('Male').astype(int)

obj = train.select_dtypes(include=['object']) #all features that are 'object' datatypes
le = preprocessing.LabelEncoder()
for i in range(len(obj.columns)):
    train[obj.columns[i]] = le.fit_transform(train[obj.columns[i]])#TODO  #Encode input data

train_set, test_set = train_test_split(train, test_size=0.3, random_state=42)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import accuracy_score


log_rgr = LogisticRegression(random_state=0)


X_train=train_set.iloc[:, 0:9]
y_train=train_set.iloc[:, 9:10]


X_test=test_set.iloc[:, 0:9]
y_test=test_set.iloc[:, 9:10]

log_rgr.fit(X_train, y_train)

y_pred = log_rgr.predict(X_test)

lr_acc = accuracy_score(y_test, y_pred)

probs = log_rgr.predict_proba(X_test)
preds = probs[:,1]
print(preds)
from sklearn.preprocessing import label_binarize
y = label_binarize(y_test, classes=[0, 1]) #note to myself: class need to have only 0,1
fpr, tpr, threshold = metrics.roc_curve(y, preds)

roc_auc = roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))

from sklearn.feature_selection import RFE


rfe = RFE(log_rgr, 5)
fit = rfe.fit(X_train, y_train)

X_train_new = fit.transform(X_train)
X_test_new = fit.transform(X_test)

log_rgr.fit(X_train_new, y_train)
y_pred = log_rgr.predict(X_test_new)

lr_acc = accuracy_score(y_test, y_pred)

probs = rfe.predict_proba(X_test)
preds = probs[:,1]
y = label_binarize(y_test, classes=[0, 1]) 

fpr, tpr, threshold = metrics.roc_curve(y, preds)
roc_auc =roc_auc_score(y_test, preds)

print("Accuracy: {}".format(lr_acc))
print("AUC: {}".format(roc_auc))

回答1:


There is simply no guarantee that any kind of feature selection (backward, forward, recursive - you name it) will actually lead to better performance in general. None at all. Such tools are there for convenience only - they may work, or they may not. Best guide and ultimate judge is always the experiment.

Apart from some very specific cases in linear or logistic regression, most notably the Lasso (which, no coincidence, actually comes from statistics), or somewhat extreme cases with too many features (aka The curse of dimensionality), even when it works (or doesn't), there is not necessarily much to explain as to why (or why not).



来源:https://stackoverflow.com/questions/61441974/recursive-feature-selection-may-not-yield-higher-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!