Pandas compare value with previous row with filtration condition

问题

I have a DataFrame with information about employee salary. It's about 900000+ rows.

Sample:

+----+-------------+---------------+----------+
|    |   table_num | name          |   salary |
|----+-------------+---------------+----------|
|  0 |      001234 | John Johnson  |     1200 |
|  1 |      001234 | John Johnson  |     1000 |
|  2 |      001235 | John Johnson  |     1000 |
|  3 |      001235 | John Johnson  |     1200 |
|  4 |      001235 | John Johnson  |     1000 |
|  5 |      001235 | Steve Stevens |     1000 |
|  6 |      001236 | Steve Stevens |     1200 |
|  7 |      001236 | Steve Stevens |     1200 |
|  8 |      001236 | Steve Stevens |     1200 |
+----+-------------+---------------+----------+

dtypes:

table_num: string
name: string
salary: float

I need to add a column with information about increased\decreased salary level. I'm using the shift() function to compare value in rows.

Main problem is in filtration and iteration over all unique employees over the whole dataset.

It takes about 3 and half hour in my script.

How to do it faster?

My script:

# giving us only unique combination of 'table_num' and 'name'
    # since there can be same 'table_num' for different 'name'
    # and same names with different 'table_num' appears sometimes

names_df = df[['table_num', 'name']].drop_duplicates()

# then extracting particular name and table_num from Series
for i in range(len(names_df)):    ### Bottleneck of whole script ###    
    t = names_df.iloc[i,[0,1]][0]
    n = names_df.iloc[i,[0,1]][1]

    # using shift() and lambda to check if there difference between two rows 
    diff_sal = (df[(df['table_num']==t)
               & ((df['name']==n))]['salary'] - df[(df['table_num']==t)
                                                 & ((df['name']==n))]['salary'].shift(1)).apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
    df.loc[diff_sal.index, 'inc'] = diff_sal.values

Sample input data:

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'], 
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'], 
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

Sample output:

+----+-------------+---------------+----------+-------+
|    |   table_num | name          |   salary |   inc |
|----+-------------+---------------+----------+-------|
|  0 |      001234 | John Johnson  |     1200 |     0 |
|  1 |      001234 | John Johnson  |     1000 |    -1 |
|  2 |      001235 | John Johnson  |     1000 |     0 |
|  3 |      001235 | John Johnson  |     1200 |     1 |
|  4 |      001235 | John Johnson  |     1000 |    -1 |
|  5 |      001235 | Steve Stevens |     1000 |     0 |
|  6 |      001236 | Steve Stevens |     1200 |     0 |
|  7 |      001236 | Steve Stevens |     1200 |     0 |
|  8 |      001236 | Steve Stevens |     1200 |     0 |
+----+-------------+---------------+----------+-------+

回答1:

Use groupby together with diff:

df['inc'] = df.groupby(['table_num', 'name'])['salary'].diff().fillna(0.0)
df.loc[df['inc'] > 0.0, 'inc'] = 1.0
df.loc[df['inc'] < 0.0, 'inc'] = -1.0

回答2:

Use DataFrameGroupBy.diff with numpy.sign and last cast to integers:

df['new'] = np.sign(df.groupby(['table_num', 'name'])['salary'].diff().fillna(0)).astype(int)
print (df)
   table_num           name  salary  new
0       1234   John Johnson    1200    0
1       1234   John Johnson    1000   -1
2       1235   John Johnson    1000    0
3       1235   John Johnson    1200    1
4       1235   John Johnson    1000   -1
5       1235  Steve Stevens    1000    0
6       1236  Steve Stevens    1200    0
7       1236  Steve Stevens    1200    0
8       1236  Steve Stevens    1200    0

回答3:

shift() is the way to go but you should avoid as much as you can to use loops. Here we can leverage the power of groupby() and transform(). Check the pandas docs.

In your case you can do it by writing:

df.assign(inc=lambda x: x.groupby(['name'])
                      .salary
                      .transform(lambda y: y - y.shift(1))
                      .apply(lambda x: 1 if x>0 else (-1 if x<0 else 0))
      )

yields:

    table_num   name       salary   inc
0   001234  John Johnson    1200.0  0
1   001234  John Johnson    1000.0  -1
2   001235  John Johnson    1000.0  0
3   001235  John Johnson    1200.0  1
4   001235  John Johnson    1000.0  -1
5   001235  Steve Stevens   1000.0  0
6   001236  Steve Stevens   1200.0  1
7   001236  Steve Stevens   1200.0  0
8   001236  Steve Stevens   1200.0  0

回答4:

I think you can search for terms: "pandas vectorization" to speed up operation with dataframe, for your question, could you try the following:

import pandas as pd

df = pd.DataFrame({'table_num': ['001234','001234','001235','001235','001235','001235','001236','001236','001236'],
                     'name': ['John Johnson','John Johnson','John Johnson','John Johnson','John Johnson', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens', 'Steve Stevens'],
                     'salary':[1200.,1000.,1000.,1200.,1000.,1000.,1200.,1200.,1200.]})

df['temp'] = df['name'] + df['table_num']
df.sort_values('temp', inplace=True)
df['diff'] = df.groupby('temp')['salary'].diff()
df['diff'] = (df['diff'] / abs(df['diff'])).fillna(0)

来源：https://stackoverflow.com/questions/52072315/pandas-compare-value-with-previous-row-with-filtration-condition

标签

python

pandas

dataframe

compare

rows