Computing age from to_timedelta is weird, and DateOffset is not scalable over a Series

后端 未结 1 1954
一整个雨季
一整个雨季 2020-12-21 07:21

I have two columns:

          date   age
0   2016-01-05  47.0
1   2016-01-05  43.0
2   2016-01-05  28.0
3   2016-01-05  46.0
4   2016-01-04  39.0


        
相关标签:
1条回答
  • 2020-12-21 08:21

    If you need to specify a different non-standard offset (i.e. months or years) for every row it can save time to loop over the unique offsets instead of rows. Accomplish this with a groupby.

    This will be especially true when the number of unique offsets is << the number of rows in your DataFrame. This is very likely the case with realistic values for integer ages and a very long DataFrame.

    pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=age))
               for age, gp in df.groupby('age', sort=False)])
    
            date   age        dob
    0 2016-01-05  47.0 1969-01-05
    1 2016-01-05  43.0 1973-01-05
    2 2016-01-05  28.0 1988-01-05
    3 2016-01-05  46.0 1970-01-05
    4 2016-01-04  39.0 1977-01-04
    

    Some timings:

    import perfplot
    import pandas as pd
    import numpy as np
    
    perfplot.show(
        setup=lambda n: pd.DataFrame({'date': np.random.choice(pd.date_range('1980-01-01', freq='50D', periods=100), n),
                                      'age': np.random.choice(range(100), n)}), 
        kernels=[
            lambda df: pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=idx)) 
                                  for idx, gp in df.groupby('age', sort=False)]),
            lambda df: df.apply(lambda x: x['date'] - pd.DateOffset(years=int(x['age'])), axis=1),
        ],
        labels=["groupby", "apply"],
        n_range=[2 ** k for k in range(15)],
        equality_check=None,  # Because datetime type
        xlabel="len(df)"
    )
    

    0 讨论(0)
提交回复
热议问题