Normalizing Pandas Series with condition

问题

I'm learning Python/Pandas with a DataFrame having the following structure:

import pandas as pd

df = pd.DataFrame({'key' : [111, 222, 333, 444, 555, 666, 777, 888, 999],
                   'score1' : [-1, 0, 2, -1, 7, 0, 15, 0, 1], 
                   'score2' : [2, 2, -1, 10, 0, 5, -1, 1, 0]})

print(df)

   key  score1  score2
0  111      -1       2
1  222       0       2
2  333       2      -1
3  444      -1      10
4  555       7       0
5  666       0       5
6  777      15      -1
7  888       0       1
8  999       1       0

The possible values for the score1 and score2 Series are -1 and all positive integers (including 0).

My goal is to normalize both columns the following way:

If the value is equal to -1, then return a missing NaN value
Else, normalize the remaining positive integers on a scale between 0 and 1.

I don't want to overwrite the original Series score1 and score2. Instead, I would like to apply a function on both Series to create two new columns (say norm1 and norm2).

I read several posts here that recommend to use the MinMaxScaler() method from sklearn preprocessing module. I don't think this is what I need since I need an extra condition to take care of the -1 values.

What I think I need is a specific function that I can apply on both Series. I also familiarized myself with how normalization works but I'm having difficulties implementing this function in Python. Any additional help would be greatly appreciated.

回答1:

Idea is convert -1 values to missing values:

cols = ['score1','score2']
df[cols] = df[cols].mask(df[cols] == -1)

x = df[cols].values
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = df.join(pd.DataFrame(x_scaled, columns=cols).add_prefix('norm_'))
print (df)
   key  score1  score2  norm_score1  norm_score2
0  111     NaN     2.0          NaN          0.2
1  222     0.0     2.0     0.000000          0.2
2  333     2.0     NaN     0.133333          NaN
3  444     NaN    10.0          NaN          1.0
4  555     7.0     0.0     0.466667          0.0
5  666     0.0     5.0     0.000000          0.5
6  777    15.0     NaN     1.000000          NaN
7  888     0.0     1.0     0.000000          0.1
8  999     1.0     0.0     0.066667          0.0

来源：https://stackoverflow.com/questions/57851077/normalizing-pandas-series-with-condition

标签

python

pandas

dataframe

normalization