Pandas compare value with previous row with filtration condition

匆匆过客 提交于 2019-12-23 21:49:28

问题


I have a DataFrame with information about employee salary. It's about 900000+ rows.

Sample:

+----+-------------+---------------+----------+
|    |   table_num | name          |   salary |
|----+-------------+---------------+----------|
|  0 |      001234 | John Johnson  |     1200 |
|  1 |      001234 | John Johnson  |     1000 |
|  2 |      001235 | John Johnson  |     1000 |
|  3 |      001235 | John Johnson  |     1200 |
|  4 |      001235 | John Johnson  |     1000 |
|  5 |      001235 | Steve Stevens |     1000 |
|  6 |      001236 | Steve Stevens |     1200 |
|  7 |      001236 | Steve Stevens |     1200 |
|  8 |      001236 | Steve Stevens |     1200 |
+----+-------------+---------------+----------+

dtypes:

table_num: string
name: string
salary: float

I need to add a column with information about increased\decreased salary level. I'm using the shift() function to compare value in rows.

Main problem is in filtration and iteration over all unique employees over the whole dataset.

It takes about 3 and half hour in my script.

How to do it faster?

My script:

# giving us only unique combination of 'table_num' and 'name'
    # since there can be same 'table_num' for different 'name'
    # and same names with different 'table_num' appears sometimes

names_df = df[['table_num', 'name']].drop_duplicates()

# then extracting particular name and table_num from Series
for i in range(len(names_df)):    ### Bottleneck of whole script ###    
    t = names_df.iloc[i,[0,1]][0]
    n = names_df.iloc[i,[0,1]][1]

    # using shift() and lambda to check if there difference between two rows 
    diff_sal = (df[(df['table_num']==t)
               & ((df['name']==n))]['salary'] - df[(df['table_num']==t)
                                                 & ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
    df.loc[diff_sal.index, 'inc'] = diff_sal.values

Sample input data:

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'], 
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'], 
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

Sample output:

+----+-------------+---------------+----------+-------+
|    |   table_num | name          |   salary |   inc |
|----+-------------+---------------+----------+-------|
|  0 |      001234 | John Johnson  |     1200 |     0 |
|  1 |      001234 | John Johnson  |     1000 |    -1 |
|  2 |      001235 | John Johnson  |     1000 |     0 |
|  3 |      001235 | John Johnson  |     1200 |     1 |
|  4 |      001235 | John Johnson  |     1000 |    -1 |
|  5 |      001235 | Steve Stevens |     1000 |     0 |
|  6 |      001236 | Steve Stevens |     1200 |     0 |
|  7 |      001236 | Steve Stevens |     1200 |     0 |
|  8 |      001236 | Steve Stevens |     1200 |     0 |
+----+-------------+---------------+----------+-------+

回答1:


Use groupby together with diff:

df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0



回答2:


Use DataFrameGroupBy.diff with numpy.sign and last cast to integers:

df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
   table_num           name  salary  new
0       1234   John Johnson    1200    0
1       1234   John Johnson    1000   -1
2       1235   John Johnson    1000    0
3       1235   John Johnson    1200    1
4       1235   John Johnson    1000   -1
5       1235  Steve Stevens    1000    0
6       1236  Steve Stevens    1200    0
7       1236  Steve Stevens    1200    0
8       1236  Steve Stevens    1200    0



回答3:


shift() is the way to go but you should avoid as much as you can to use loops. Here we can leverage the power of groupby() and transform(). Check the pandas docs.

In your case you can do it by writing:

df.assign(inc=lambda x: x.groupby(['name'])
                      .salary
                      .transform(lambda y: y - y.shift(1))
                      .apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
      )

yields:

    table_num   name       salary   inc
0   001234  John Johnson    1200.0  0
1   001234  John Johnson    1000.0  -1
2   001235  John Johnson    1000.0  0
3   001235  John Johnson    1200.0  1
4   001235  John Johnson    1000.0  -1
5   001235  Steve Stevens   1000.0  0
6   001236  Steve Stevens   1200.0  1
7   001236  Steve Stevens   1200.0  0
8   001236  Steve Stevens   1200.0  0



回答4:


I think you can search for terms: "pandas vectorization" to speed up operation with dataframe, for your question, could you try the following:

import pandas as pd

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)


来源:https://stackoverflow.com/questions/52072315/pandas-compare-value-with-previous-row-with-filtration-condition

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!