How to keep track of previous date record column in pandas dataframe?

后端 未结 1 751
说谎
说谎 2021-01-25 11:35

This question is referenced from the this SO Question.

I want to perform some data analysis on pandas Dataframe. I have one dataframe like below:

                


        
相关标签:
1条回答
  • 2021-01-25 11:56

    What you can do is merge the dataframe into itself, after computing the month number (from the date), and the previous one as well.

    Let's start with computing those 2 values. For convenience purposes, I firstly converted the raw month string value to datetime, which allowed me to use relativedelta to compute the previous month. This ensures behaviour is correct, even after a change of year.

    In [7]: df['month'] = pd.to_datetime(df['month'])
    
    In [8]: df['month_num'] = df['month'].apply(lambda x: x.strftime('%Y-%m'))
    
    In [9]: from dateutil.relativedelta import relativedelta
    
    In [10]: df['previous_month_num'] = df['month'].apply(lambda x: (x + relativedelta(months=-1)).strftime('%Y-%m'))
    
    In [11]: df
    Out[11]:
         city      month person_count person_name person_symbol sir  sport_name  \
    0  mumbai 2017-01-23           10      ramesh           ram   a    football
    1  mumbai 2017-01-23           14      ramesh           mum   a    football
    2   delhi 2017-01-23           25      ramesh           mum   a    football
    3   delhi 2017-01-23           20      ramesh           ram   a    football
    4  mumbai 2017-02-22           34      ramesh           ram   b    football
    5  mumbai 2017-02-22           23      ramesh           mum   b    football
    6   delhi 2017-02-22           43      ramesh           mum   b    football
    7   delhi 2017-02-22           34      ramesh           ram   b    football
    8    pune 2017-03-03           10      mahesh           mah   c  basketball
    9  nagpur 2017-03-03           20      mahesh           mah   c  basketball
    
      month_num previous_month_num
    0   2017-01            2016-12
    1   2017-01            2016-12
    2   2017-01            2016-12
    3   2017-01            2016-12
    4   2017-02            2017-01
    5   2017-02            2017-01
    6   2017-02            2017-01
    7   2017-02            2017-01
    8   2017-03            2017-02
    9   2017-03            2017-02
    

    We can then merge the dataframe into itself, using the computed month values as merging keys:

    In [12]: relevant_columns = ['city', 'person_symbol', 'sport_name']
    
    In [13]: pd.merge(df, df, left_on=relevant_columns + ['previous_month_num'], right_on=rele
        ...: vant_columns + ['month_num'], how='left', suffixes=('', '_previous'))[list(df.col
        ...: umns) + ['person_count_previous']].fillna(0).drop(['month_num', 'previous_month_n
        ...: um'], axis=1)
    Out[13]:
         city      month person_count person_name person_symbol sir  sport_name  \
    0  mumbai 2017-01-23           10      ramesh           ram   a    football
    1  mumbai 2017-01-23           14      ramesh           mum   a    football
    2   delhi 2017-01-23           25      ramesh           mum   a    football
    3   delhi 2017-01-23           20      ramesh           ram   a    football
    4  mumbai 2017-02-22           34      ramesh           ram   b    football
    5  mumbai 2017-02-22           23      ramesh           mum   b    football
    6   delhi 2017-02-22           43      ramesh           mum   b    football
    7   delhi 2017-02-22           34      ramesh           ram   b    football
    8    pune 2017-03-03           10      mahesh           mah   c  basketball
    9  nagpur 2017-03-03           20      mahesh           mah   c  basketball
    
      person_count_previous
    0                     0
    1                     0
    2                     0
    3                     0
    4                    10
    5                    14
    6                    25
    7                    20
    8                     0
    9                     0
    

    Some comments:

    • I used ['city', 'person_symbol', 'sport_name'] as the reference columns, but feel free to add some more, depending on what exactly you want to achieve.
    • The new column is named person_count_previous, but you can rename it, should it be best for you.
    • By default, when there is no match for the previous count, the column will be NaN. I replaced the values with 0, thanks to fillna.
    • I removed the "temporary" columns using drop, but feel free to keep them.
    0 讨论(0)
提交回复
热议问题