问题
I have a database as partially shown below. For each date, there are entries for duration (1-20 per date), with items (100s) listed for each duration. Each item has several associated data points in adjacent columns, including an identifier. For each date, I want to select the largest duration. Then, I want to find the item with a value closest to a given input value. I would like to then obtain the ID for that item to be able to follow the value of this item through its time in the database.
Index Date Duration Item Value ID
0 1/1/2018 30 100 4 a
1 1/1/2018 30 200 8 b
2 1/1/2018 30 300 20 c
3 1/1/2018 60 100 9 d
4 1/1/2018 60 200 19 e
5 1/1/2018 60 300 33 f
6 1/1/2018 60 400 50 g
7 1/2/2018 31 100 3 a
8 1/2/2018 31 200 7 b
9 1/2/2018 31 300 20 c
10 1/2/2018 61 100 8 d
11 1/2/2018 61 200 17 e
12 1/2/2018 61 300 30 f
I thought the pandas groupby function would be ideal for creating the date/duration groups:
df = df.groupby('Date')['Duration'].max() #creates the correct groups of max duration for each date
Without groupby, the data can be obtained by finding the correct row, for instance:
row = df['ID'].index(df['Value'] - target_value).abs().argsort()[:1]]
id = df.loc[row, 'ID']
But that doesn't work in groupby groups. I've tried to solve this via other pandas operations, but cannot figure out how to obtain the ID data after selecting the item with the correct Value. There are many questions on SO regarding extracting data in specific columns (or applying functions to data in specific columns) after pandas.groupby, but I didn't find anything on selecting data in adjacent columns. I would appreciate it if you can point me in the right direction.
回答1:
You could do something like the following:
target_value = 15
df['max_duration'] = df.groupby('Date')['Duration'].transform('max')
df.query('max_duration == Duration')\
.assign(dist=lambda df: np.abs(df['Value'] - target_value))\
.assign(min_dist=lambda df: df.groupby('Date')['dist'].transform('min'))\
.query('min_dist == dist')\
.loc[:, ['Date', 'ID']
Results:
Date ID
4 1/1/2018 e
11 1/2/2018 e
回答2:
following your logic:
idx = df.groupby(['Date'])['Duration'].transform(max) == df['Duration']
#tgt_value = 19
d = df[idx]
d['dist']=(d['Value'] - 19).abs()
Row_result = d.loc[d['dist'].idxmin()]
回答3:
i hope i'm understanding you correctly,and there might be an easier and simpler way, but here are my thoughts:
data = [['1/1/2018' , 30 , 100 , 4 , 'a'],
['1/1/2018' , 30 , 200 , 8 , 'b'],
['1/1/2018' , 30 , 300 , 20 , 'c'],
['1/1/2018' , 60 , 100 , 9 , 'd'],
['1/1/2018' , 60 , 200 ,19 , 'e'],
['1/1/2018' , 60 , 300 ,33 , 'f'],
['1/1/2018' , 60 , 400 ,50 , 'g'],
['1/2/2018' , 31 , 100 , 3 , 'a'],
['1/2/2018' , 31 , 200 , 7 , 'b'],
['1/2/2018' , 31 , 300 , 20 , 'c'],
['1/2/2018' , 61 , 100 , 8 , 'd'],
['1/2/2018' , 61 , 200 , 17 , 'e'],
['1/2/2018' , 61 , 300 , 30 , 'f']]
df = pd.DataFrame(data=data, columns=['Date','Duration','Item','Value','ID'])
df1 = df.groupby('Date', as_index=False)[['Duration']].max()
df2 = pd.merge(df,df1, how='inner')
#target_value = 19
df2['diff']=(df2.Value-target_value).abs()
result=df2.loc[df2.groupby('Date')['diff'].idxmin()]
the result dataframe contains the value that is closest to your input value. if you only want the 'ID' column then
IDresult = result[['ID']]
来源:https://stackoverflow.com/questions/54813114/pandas-groupby-how-to-select-adjacent-column-data-after-selecting-a-row-based-o