问题
I have a DataFrame with information about employee salary. It's about 900000+ rows.
Sample:
+----+-------------+---------------+----------+
| | table_num | name | salary |
|----+-------------+---------------+----------|
| 0 | 001234 | John Johnson | 1200 |
| 1 | 001234 | John Johnson | 1000 |
| 2 | 001235 | John Johnson | 1000 |
| 3 | 001235 | John Johnson | 1200 |
| 4 | 001235 | John Johnson | 1000 |
| 5 | 001235 | Steve Stevens | 1000 |
| 6 | 001236 | Steve Stevens | 1200 |
| 7 | 001236 | Steve Stevens | 1200 |
| 8 | 001236 | Steve Stevens | 1200 |
+----+-------------+---------------+----------+
dtypes:
table_num: string
name: string
salary: float
I need to add a column with information about increased\decreased salary level.
I'm using the shift()
function to compare value in rows.
Main problem is in filtration and iteration over all unique employees over the whole dataset.
It takes about 3 and half hour in my script.
How to do it faster?
My script:
# giving us only unique combination of 'table_num' and 'name'
# since there can be same 'table_num' for different 'name'
# and same names with different 'table_num' appears sometimes
names_df = df[['table_num', 'name']].drop_duplicates()
# then extracting particular name and table_num from Series
for i in range(len(names_df)): ### Bottleneck of whole script ###
t = names_df.iloc[i,[0,1]][0]
n = names_df.iloc[i,[0,1]][1]
# using shift() and lambda to check if there difference between two rows
diff_sal = (df[(df['table_num']==t)
& ((df['name']==n))]['salary'] - df[(df['table_num']==t)
& ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
df.loc[diff_sal.index, 'inc'] = diff_sal.values
Sample input data:
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
Sample output:
+----+-------------+---------------+----------+-------+
| | table_num | name | salary | inc |
|----+-------------+---------------+----------+-------|
| 0 | 001234 | John Johnson | 1200 | 0 |
| 1 | 001234 | John Johnson | 1000 | -1 |
| 2 | 001235 | John Johnson | 1000 | 0 |
| 3 | 001235 | John Johnson | 1200 | 1 |
| 4 | 001235 | John Johnson | 1000 | -1 |
| 5 | 001235 | Steve Stevens | 1000 | 0 |
| 6 | 001236 | Steve Stevens | 1200 | 0 |
| 7 | 001236 | Steve Stevens | 1200 | 0 |
| 8 | 001236 | Steve Stevens | 1200 | 0 |
+----+-------------+---------------+----------+-------+
回答1:
Use groupby together with diff:
df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0
回答2:
Use DataFrameGroupBy.diff with numpy.sign and last cast to integer
s:
df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
table_num name salary new
0 1234 John Johnson 1200 0
1 1234 John Johnson 1000 -1
2 1235 John Johnson 1000 0
3 1235 John Johnson 1200 1
4 1235 John Johnson 1000 -1
5 1235 Steve Stevens 1000 0
6 1236 Steve Stevens 1200 0
7 1236 Steve Stevens 1200 0
8 1236 Steve Stevens 1200 0
回答3:
shift()
is the way to go but you should avoid as much as you can to use loops. Here we can leverage the power of groupby()
and transform()
. Check the pandas docs.
In your case you can do it by writing:
df.assign(inc=lambda x: x.groupby(['name'])
.salary
.transform(lambda y: y - y.shift(1))
.apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
)
yields:
table_num name salary inc
0 001234 John Johnson 1200.0 0
1 001234 John Johnson 1000.0 -1
2 001235 John Johnson 1000.0 0
3 001235 John Johnson 1200.0 1
4 001235 John Johnson 1000.0 -1
5 001235 Steve Stevens 1000.0 0
6 001236 Steve Stevens 1200.0 1
7 001236 Steve Stevens 1200.0 0
8 001236 Steve Stevens 1200.0 0
回答4:
I think you can search for terms: "pandas vectorization" to speed up operation with dataframe, for your question, could you try the following:
import pandas as pd
df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})
df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)
来源:https://stackoverflow.com/questions/52072315/pandas-compare-value-with-previous-row-with-filtration-condition