Pandas groupby: how to select adjacent column data after selecting a row based on data in another column in pandas groupby groups?

问题

I have a database as partially shown below. For each date, there are entries for duration (1-20 per date), with items (100s) listed for each duration. Each item has several associated data points in adjacent columns, including an identifier. For each date, I want to select the largest duration. Then, I want to find the item with a value closest to a given input value. I would like to then obtain the ID for that item to be able to follow the value of this item through its time in the database.

Index Date      Duration Item   Value  ID
0     1/1/2018     30     100      4    a
1     1/1/2018     30     200      8    b
2     1/1/2018     30     300     20    c
3     1/1/2018     60     100      9    d
4     1/1/2018     60     200     19    e
5     1/1/2018     60     300     33    f
6     1/1/2018     60     400     50    g
7     1/2/2018     31     100      3    a
8     1/2/2018     31     200      7    b
9     1/2/2018     31     300     20    c
10    1/2/2018     61     100      8    d
11    1/2/2018     61     200     17    e
12    1/2/2018     61     300     30    f

I thought the pandas groupby function would be ideal for creating the date/duration groups:

df = df.groupby('Date')['Duration'].max()   #creates the correct groups of max duration for each date

Without groupby, the data can be obtained by finding the correct row, for instance:

row = df['ID'].index(df['Value'] - target_value).abs().argsort()[:1]]
id = df.loc[row, 'ID']

But that doesn't work in groupby groups. I've tried to solve this via other pandas operations, but cannot figure out how to obtain the ID data after selecting the item with the correct Value. There are many questions on SO regarding extracting data in specific columns (or applying functions to data in specific columns) after pandas.groupby, but I didn't find anything on selecting data in adjacent columns. I would appreciate it if you can point me in the right direction.

回答1:

You could do something like the following:

target_value = 15
df['max_duration'] = df.groupby('Date')['Duration'].transform('max')
df.query('max_duration == Duration')\
  .assign(dist=lambda df: np.abs(df['Value'] - target_value))\
  .assign(min_dist=lambda df: df.groupby('Date')['dist'].transform('min'))\
  .query('min_dist == dist')\
  .loc[:, ['Date', 'ID']

Results:

        Date ID
4   1/1/2018  e
11  1/2/2018  e

回答2:

following your logic:

idx = df.groupby(['Date'])['Duration'].transform(max) == df['Duration']
#tgt_value = 19
d = df[idx]
d['dist']=(d['Value'] - 19).abs()

Row_result = d.loc[d['dist'].idxmin()]

回答3:

i hope i'm understanding you correctly,and there might be an easier and simpler way, but here are my thoughts:

data = [['1/1/2018' ,    30  ,   100    ,  4   , 'a'],
    ['1/1/2018'  ,    30 ,    200  ,    8 ,   'b'],
    ['1/1/2018'  ,   30   ,  300  ,  20  , 'c'],
    ['1/1/2018'  ,   60   ,  100     , 9   , 'd'],
    ['1/1/2018'   ,   60   ,  200     ,19  ,  'e'],
   ['1/1/2018'  ,   60   ,  300     ,33  ,  'f'],
   ['1/1/2018'   ,   60   ,  400     ,50  ,  'g'],
   ['1/2/2018'  ,   31   ,  100     , 3  ,  'a'],
   ['1/2/2018'   ,   31   ,  200    ,  7  ,  'b'],
  ['1/2/2018'  ,   31   ,  300   ,  20   , 'c'],
  ['1/2/2018'  ,   61   ,  100   ,   8   , 'd'],
  ['1/2/2018'  ,   61   ,  200   ,  17  ,  'e'],
  ['1/2/2018'  ,   61   ,  300   ,  30   , 'f']]


df = pd.DataFrame(data=data, columns=['Date','Duration','Item','Value','ID'])

df1 = df.groupby('Date', as_index=False)[['Duration']].max()
df2 = pd.merge(df,df1, how='inner')

#target_value = 19
df2['diff']=(df2.Value-target_value).abs()

result=df2.loc[df2.groupby('Date')['diff'].idxmin()]

the result dataframe contains the value that is closest to your input value. if you only want the 'ID' column then

IDresult = result[['ID']]

来源：https://stackoverflow.com/questions/54813114/pandas-groupby-how-to-select-adjacent-column-data-after-selecting-a-row-based-o

标签

python

pandas

dataframe

pandas-groupby