问题
I have the following df:
Date Event_Counts Category_A Category_B
20170401 982457 0 1
20170402 982754 1 0
20170402 875786 0 1
I am preparing the data for a regression analysis and want to standardize the column Event_Counts, so that it's on a similar scale like the categories.
I use the following code:
from sklearn import preprocessing
df['scaled_event_counts'] = preprocessing.scale(df['Event_Counts'])
While I do get this warning:
DataConversionWarning: Data with input dtype int64 was converted to float64 by the scale function.
warnings.warn(msg, _DataConversionWarning)
it seems to have worked; there is a new column. However, it has negative numbers like -1.3
What I thought the scale function does is subtract the mean from the number and divide it by the standard deviation for every row; then add the min of the result to every row.
Does it not work for pandas that way? Or should I use the normalize() function or StandardScaler() function? I wanted to have the standardize column on a scale of 0 to 1.
Thank You
回答1:
I think you are looking for the sklearn.preprocessing.MinMaxScaler. That will allow you to scale to a given range.
So in your case it would be:
scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
df['scaled_event_counts'] = scaler.fit_transform(df['Event_Counts'])
To scale the entire df:
scaled_df = scaler.fit_transform(df)
print(scaled_df)
[[ 0. 0.99722347 0. 1. ]
[ 1. 1. 1. 0. ]
[ 1. 0. 0. 1. ]]
回答2:
Scaling is done by subtracting the mean and dividing by the standard deviation of each feature (column). So,
scaled_event_counts = (Event_Counts - mean(Event_Counts)) / std(Event_Counts)
The int64 to float64 warning comes from having to subtract the mean, which would be a floating point number, and not just an integer.
You will have negative numbers with the scaled column because the mean will be normalized to zero.
来源:https://stackoverflow.com/questions/43458593/python-pandas-standardize-column-for-regression