I have two columns:
date age
0 2016-01-05 47.0
1 2016-01-05 43.0
2 2016-01-05 28.0
3 2016-01-05 46.0
4 2016-01-04 39.0
If you need to specify a different non-standard offset (i.e. months or years) for every row it can save time to loop over the unique offsets instead of rows. Accomplish this with a groupby
.
This will be especially true when the number of unique offsets is << the number of rows in your DataFrame. This is very likely the case with realistic values for integer ages and a very long DataFrame.
pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=age))
for age, gp in df.groupby('age', sort=False)])
date age dob
0 2016-01-05 47.0 1969-01-05
1 2016-01-05 43.0 1973-01-05
2 2016-01-05 28.0 1988-01-05
3 2016-01-05 46.0 1970-01-05
4 2016-01-04 39.0 1977-01-04
Some timings:
import perfplot
import pandas as pd
import numpy as np
perfplot.show(
setup=lambda n: pd.DataFrame({'date': np.random.choice(pd.date_range('1980-01-01', freq='50D', periods=100), n),
'age': np.random.choice(range(100), n)}),
kernels=[
lambda df: pd.concat([gp.assign(dob = gp.date - pd.offsets.DateOffset(years=idx))
for idx, gp in df.groupby('age', sort=False)]),
lambda df: df.apply(lambda x: x['date'] - pd.DateOffset(years=int(x['age'])), axis=1),
],
labels=["groupby", "apply"],
n_range=[2 ** k for k in range(15)],
equality_check=None, # Because datetime type
xlabel="len(df)"
)