Computing age from to_timedelta is weird, and DateOffset is not scalable over a Series

后端未结

关注

 1  1955

I have two columns:

          date   age
0   2016-01-05  47.0
1   2016-01-05  43.0
2   2016-01-05  28.0
3   2016-01-05  46.0
4   2016-01-04  39.0

相关标签:

1条回答

南笙

2020-12-21 08:21

If you need to specify a different non-standard offset (i.e. months or years) for every row it can save time to loop over the unique offsets instead of rows. Accomplish this with a groupby.

This will be especially true when the number of unique offsets is << the number of rows in your DataFrame. This is very likely the case with realistic values for integer ages and a very long DataFrame.

pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=age))
           for age, gp in df.groupby('age', sort=False)])

        date   age        dob
0 2016-01-05  47.0 1969-01-05
1 2016-01-05  43.0 1973-01-05
2 2016-01-05  28.0 1988-01-05
3 2016-01-05  46.0 1970-01-05
4 2016-01-04  39.0 1977-01-04

Some timings:

import perfplot
import pandas as pd
import numpy as np

perfplot.show(
    setup=lambda n: pd.DataFrame({'date': np.random.choice(pd.date_range('1980-01-01', freq='50D', periods=100), n),
                                  'age': np.random.choice(range(100), n)}), 
    kernels=[
        lambda df: pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=idx)) 
                              for idx, gp in df.groupby('age', sort=False)]),
        lambda df: df.apply(lambda x: x['date'] - pd.DateOffset(years=int(x['age'])), axis=1),
    ],
    labels=["groupby", "apply"],
    n_range=[2 ** k for k in range(15)],
    equality_check=None,  # Because datetime type
    xlabel="len(df)"
)

0 讨论(0)