Use MinMaxScaler on training data to generate std, min and max to be used on testing data

会有一股神秘感。 提交于 2019-12-11 05:56:55

问题


How would I use the scikit-learn MinMaxScaler to standardize every column in a pandas data-frame training data set, but use the exact same standard deviation, min/max formula on my test data set?

Since my testing data is unknown to the model, I dont want to standardize the whole data set, it would not be an accurate model for future unknown data. Instead I would like to standardize the data between 0 & 1 using the training set, and use the same std, min and max numbers for the formula on the test data.

(Obviously I can write my own min-max scaler, but wondering if scikit-learn can do this already or if there is a library I can use for this first)


回答1:


You should be able to fit it on your training data and then transform your test data:

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)  # or: fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Your approach now seems like good practice. If you were to call fit on your entire X matrix (train and test combined), you'd be causing information leakage as your training data would have "seen" the scale of your test data beforehand. Using a class-based implementation of MinMaxScaler() is how sklearn addresses this specifically, allowing the object to "remember" attributes of the data on which it was fit.

However, be aware that MinMaxScaler() does not scale to ~N(0, 1). In fact, it is explicitly billed as an alternative to this scaling. In other words, it does not guarantee you unit variance or 0 mean at all. In fact, it really doesn't care about standard deviation as it's defined in the traditional sense.

From the docstring:

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max_ - min_) + min_

Where min_ and max_ are equal to your unpacked feature_range (default (0, 1)) from the __init__ of MinMaxScaler(). Manually this is:

def scale(a):
    # implicit feature_range=(0,1)
    return (a - X_train.min(axis=0)) / (X_train.max(axis=0) - X_train.min(axis=0))

So say you had: import numpy as np from sklearn.model_selection import train_test_split

np.random.seed(444)

X = np.random.normal(loc=5, scale=2, size=(200, 3))
y = np.random.normal(loc=-5, scale=3, size=X.shape[0])
X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=444)

If you were to call

scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)

Know that scaler.scale_ is not standard deviation of the data on which you did the fitting.

scaler.scale_
# array([ 0.0843,  0.0852,  0.0876])

X_train.std(axis=0)
# array([ 2.042 ,  2.0767,  2.1285])

Instead, it is:

(1 - 0) / (X_train.max(axis=0) - X_train.min(axis=0))
# array([ 0.0843,  0.0852,  0.0876])


来源:https://stackoverflow.com/questions/48511048/use-minmaxscaler-on-training-data-to-generate-std-min-and-max-to-be-used-on-tes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!