Why does the standardscaler have different effects under different number of features

问题

I experimented with breast cancer data from scikit-learn.

Use all features and not use standardscaler:

cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))

result 1 : 0.9473684210526315

Use all features and use standardscaler:

cancer = datasets.load_breast_cancer()
x = cancer.data
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)

pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))

result 2 : 0.9736842105263158

Use only two features and not use standardscaler:

cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))

result 3 : 0.37719298245614036

Use only two features and use standardscaler:

cancer = datasets.load_breast_cancer()
x = cancer.data[:,[27,22]]
y = cancer.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

sc=StandardScaler()
sc.fit(x_train)
x_train=sc.transform(x_train)
x_test=sc.transform(x_test)

pla = Perceptron().fit(x_train, y_train)
y_pred = pla.predict(x_test)
print(accuracy_score(y_test, y_pred))

result 4 : 0.9824561403508771

As result1, result2, result3, result4 show , accuracy has much improvement with Standardscaler while training with fewer features.

So I wondering Why does the standardscaler have different effects under different number of features?

PS. Here is the two featrues I choose:

回答1:

TL;DR

Don't do feature selection as long as you do not understand fully why you're doing it and in which way it may assist your algo in learning and generalizing better. For starter, please read http://www.feat.engineering/selection.html by Max Kuhn

Full read.

I suspect you tried to select a best feature subset and encountered a situation where a [arbitrary] subset performed better than the whole dataset. StandardScaling is out of question here because it's considered a standard preprocessing procedure for the algo of yours. So your real question should be "Why a subset of features perform better than a full dataset?"

Why your selection algo is arbitrary? 2 reasons.

First. Nobody has proven most linearly correlated feature would improve your [or any other if you wish] algo. Second. The best feature subset is different from what is necessitated by best correlated features.

Let's see this with code.

A feature subset giving best accuracy (note a)

Lets do a brute force.

acc_bench = 0.9736842105263158 # accuracy on all features
res = {}
f = x_train.shape[1]
pcpt = Perceptron(n_jobs=-1)
from itertools import combinations
for i in tqdm(range(2,10)):
    features_list = combinations(range(f),i)
    for features in features_list:
        pcpt.fit(x_train[:,features],y_train)
        preds = pcpt.predict(x_test[:, features])
        acc = accuracy_score(y_test, preds)
        if acc > acc_bench:
            acc_bench = acc
            res["accuracy"] = acc_bench
            res["features"] = features
print(res)
{'accuracy': 1.0, 'features': (0, 15, 22)}

So you see, that features [0,15,22] give perfect accuracy over validation dataset.

Do best features have anything to do with correlation to target?

Let's find a list orderd by a degree of linear correlation.

featrues = pd.DataFrame(cancer.data, columns=cancer.feature_names) 
target = pd.DataFrame(cancer.target, columns=['target']) 
cancer_data = pd.concat([featrues,target], axis=1) 
features_list = np.argsort(np.abs(cancer_data.corr()['target'])[:-1].values)[::-1]
feature_list
array([27, 22,  7, 20,  2, 23,  0,  3,  6, 26,  5, 25, 10, 12, 13, 21, 24,
       28,  1, 17,  4,  8, 29, 15, 16, 19, 14,  9, 11, 18])

You see, that best feature subset found by brute force has nothing to do with correlation.

Can linear correlation explain accuracy of Perceptron?

Let's try plotting num of feature from the above list (starting with 2 most correlated) vs resulting accuracy.

res = dict()
for i in tqdm(range(2,10)):
    features=features_list[:i]
    pcpt.fit(x_train[:,features],y_train)
    preds = pcpt.predict(x_test[:, features])
    acc = accuracy_score(y_test, preds)
    res[i]=[acc]
pd.DataFrame(res).T.plot.bar()
plt.ylim([.9,1])

Once again, linear correlated features have nothing to do with perceptron accuracy.

Conclusion.

Don't select feature prior to any algo unless you're perfectly sure what you're doing and what would be the effects of doing this. Do not mix up diffrent selection and learning algos because different algos have diferent opinions of what is important and what is not. A feature unimportant for one algo may become important for another. This is especially true for linear vs nonlinear algos.

If you want to improve accuracy of your algo do data cleaning or feature engineering instead.

来源：https://stackoverflow.com/questions/64449113/why-does-the-standardscaler-have-different-effects-under-different-number-of-fea

标签

python

machine-learning

scikit-learn

statistics

feature-selection