Pandas: find maximum value, when and if conditions

问题

I have a dataframe, df:

id  volume  saturation  time_delay_normalised   speed   BPR_free_speed  BPR_speed   Volume  time_normalised
27WESTBOUND 580 0.351515152 57  6.54248366  17.88   15.91366177 580 1.59375
27WESTBOUND 588 0.356363636 100 5.107142857 17.88   15.86519847 588 2.041666667
27WESTBOUND 475 0.287878788 64  6.25625 17.88   16.51161331 475 0.666666667
27EASTBOUND 401 0.243030303 59  6.458064516 17.88   16.88283672 401 1.0914583333
27EASTBOUND 438 0.265454545 46  7.049295775 17.88   16.70300418 438 1.479166667
27EASTBOUND 467 0.283030303 58  6.5 17.88   16.55392848 467 0.9604166667

I wish to create a new column, free_capacity and set it as the maximum value of Volume, per ID, when time_normalised is less than or equal to 1.1

Without considering the time_normalised condition, I can do this:

df['free_capacity'] = df.groupby('id')["Volume"].transform('max')

How do I add the when time_normalised <= 1.1 condition?

EDIT

@jezrael suggested the following:

df.loc[df['time_normalised'] <= 1.1, 'free_capacity'] = df.loc[df['time_normalised'] <= 1.1].groupby('id')["Volume"].transform('max')

Which gives:

id  volume  saturation  time_delay_normalised     speed  \
27WESTBOUND     580    0.351515                     57  6.542484   
27WESTBOUND     588    0.356364                    100  5.107143   
27WESTBOUND     475    0.287879                     64  6.256250   
27EASTBOUND     401    0.243030                     59  6.458065   
27EASTBOUND     438    0.265455                     46  7.049296   
27EASTBOUND     467    0.283030                     58  6.500000   

   BPR_free_speed  BPR_speed  Volume  time_normalised  free_capacity  
          17.88  15.913662     580         1.593750            NaN  
          17.88  15.865198     588         2.041667            NaN  
          17.88  16.511613     475         0.666667          475.0  
          17.88  16.882837     401         1.091458          467.0  
          17.88  16.703004     438         1.479167            NaN  
          17.88  16.553928     467         0.960417          467.0

However, I still wish to attribute the value of free_capacity, identified by id

Thus, I tried:

df['free_capacity'] = df.loc[df['time_normalised'] <= 1.1].groupby('id')["Volume"].transform('max')

However, this still results in NaN values. The 1.1 time_normalised condition is for finding the value, not limiting its application.

The desired outcome:

id  volume  saturation  time_delay_normalised     speed  \
    27WESTBOUND     580    0.351515                     57  6.542484   
    27WESTBOUND     588    0.356364                    100  5.107143   
    27WESTBOUND     475    0.287879                     64  6.256250   
    27EASTBOUND     401    0.243030                     59  6.458065   
    27EASTBOUND     438    0.265455                     46  7.049296   
    27EASTBOUND     467    0.283030                     58  6.500000   

       BPR_free_speed  BPR_speed  Volume  time_normalised  free_capacity  
             17.88  15.913662     580         1.593750          475.0  
             17.88  15.865198     588         2.041667          475.0  
             17.88  16.511613     475         0.666667          475.0  
             17.88  16.882837     401         1.091458          467.0  
             17.88  16.703004     438         1.479167          467.0 
             17.88  16.553928     467         0.960417          467.0

回答1:

You can use where for filtering by conditions and then groupby by Series df['id'] with transform:

df['free_capacity'] = df['Volume'].where(df['time_normalised'] <= 1.1)
                                  .groupby(df['id'])
                                  .transform('max')
print df
            id  volume  saturation  time_delay_normalised     speed  \
0  27WESTBOUND     580    0.351515                     57  6.542484   
1  27WESTBOUND     588    0.356364                    100  5.107143   
2  27WESTBOUND     475    0.287879                     64  6.256250   
3  27EASTBOUND     401    0.243030                     59  6.458065   
4  27EASTBOUND     438    0.265455                     46  7.049296   
5  27EASTBOUND     467    0.283030                     58  6.500000   

   BPR_free_speed  BPR_speed  Volume  time_normalised  free_capacity  
0           17.88  15.913662     580         1.593750          475.0  
1           17.88  15.865198     588         2.041667          475.0  
2           17.88  16.511613     475         0.666667          475.0  
3           17.88  16.882837     401         1.091458          467.0  
4           17.88  16.703004     438         1.479167          467.0  
5           17.88  16.553928     467         0.960417          467.0

It is same if use where for creating new column Volume1 by your criteria:

df['Volume1'] = df['Volume'].where(df['time_normalised'] <= 1.1)
print df
            id  volume  saturation  time_delay_normalised     speed  \
0  27WESTBOUND     580    0.351515                     57  6.542484   
1  27WESTBOUND     588    0.356364                    100  5.107143   
2  27WESTBOUND     475    0.287879                     64  6.256250   
3  27EASTBOUND     401    0.243030                     59  6.458065   
4  27EASTBOUND     438    0.265455                     46  7.049296   
5  27EASTBOUND     467    0.283030                     58  6.500000   

   BPR_free_speed  BPR_speed  Volume  time_normalised  Volume1  
0           17.88  15.913662     580         1.593750      NaN  
1           17.88  15.865198     588         2.041667      NaN  
2           17.88  16.511613     475         0.666667    475.0  
3           17.88  16.882837     401         1.091458    401.0  
4           17.88  16.703004     438         1.479167      NaN  
5           17.88  16.553928     467         0.960417    467.0

Use groupby with transform with new column Volume1:

df['free_capacity'] = df.groupby('id')["Volume1"].transform('max')
print df
            id  volume  saturation  time_delay_normalised     speed  \
0  27WESTBOUND     580    0.351515                     57  6.542484   
1  27WESTBOUND     588    0.356364                    100  5.107143   
2  27WESTBOUND     475    0.287879                     64  6.256250   
3  27EASTBOUND     401    0.243030                     59  6.458065   
4  27EASTBOUND     438    0.265455                     46  7.049296   
5  27EASTBOUND     467    0.283030                     58  6.500000   

   BPR_free_speed  BPR_speed  Volume  time_normalised  Volume1  free_capacity  
0           17.88  15.913662     580         1.593750      NaN          475.0  
1           17.88  15.865198     588         2.041667      NaN          475.0  
2           17.88  16.511613     475         0.666667    475.0          475.0  
3           17.88  16.882837     401         1.091458    401.0          467.0  
4           17.88  16.703004     438         1.479167      NaN          467.0  
5           17.88  16.553928     467         0.960417    467.0          467.0

回答2:

Consider also a groupby().apply():

def maxtime(row):
    row['free_capacity'] = row[row['time_normalised'] <= 1.1]['Volume'].max()
    return row

df = df.groupby('id').apply(maxtime)

回答3:

There can be several answers, You can also do this:

df.set_index('id', inplace=True)
df['free_capacity'] = df.groupby(level=0).apply(lambda x: x.loc[x['time_normalised']<=1.1]['volume'].max())

This gives the following:

             volume  saturation  time_delay_normalised     speed  \
id
27WESTBOUND     580    0.351515                     57  6.542484
27WESTBOUND     588    0.356364                    100  5.107143
27WESTBOUND     475    0.287879                     64  6.256250
27EASTBOUND     401    0.243030                     59  6.458065
27EASTBOUND     438    0.265455                     46  7.049296
27EASTBOUND     467    0.283030                     58  6.500000

             BPR_free_speed  BPR_speed  Volume  time_normalised    wrong_x    free_capacity
id
27WESTBOUND           17.88  15.913662     580         1.593750  588  475
27WESTBOUND           17.88  15.865198     588         2.041667  588  475
27WESTBOUND           17.88  16.511613     475         0.666667  588  475
27EASTBOUND           17.88  16.882837     401         1.091458  467  467
27EASTBOUND           17.88  16.703004     438         1.479167  467  467
27EASTBOUND           17.88  16.553928     467         0.960417  467  467

You can reset the index back if you want by df.reset_index(inplace=True) The wrong_x column is the wrong result, without the condition by doing

df['wrong_x']=B.groupby(level=0)['volume'].max()

which is what you tried initially.

来源：https://stackoverflow.com/questions/36792806/pandas-find-maximum-value-when-and-if-conditions

标签

python

pandas