I have the following dataframe:
user_id purchase_date
1 2015-01-23 14:05:21
2 2015-02-05 05:07:30
3 2015-02-18 17:08:51
4 2015-03-21 17:07:30
5 2015-03-11 18:32:56
6 2015-03-03 11:02:30
and purchase_date
is a datetime64[ns]
column. I need to add a new column df[month]
that contains first day of the month of the purchase date:
df['month']
2015-01-01
2015-02-01
2015-02-01
2015-03-01
2015-03-01
2015-03-01
I'm looking for something like DATE_FORMAT(purchase_date, "%Y-%m-01") m
in SQL. I have tried the following code:
df['month']=df['purchase_date'].apply(lambda x : x.replace(day=1))
It works somehow but returns: 2015-01-01 14:05:21
.
Simpliest and fastest is convert to numpy array
by values
and then cast:
df['month'] = df['purchase_date'].values.astype('datetime64[M]')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Another solution with floor
and pd.offsets.MonthBegin(0)
:
df['month'] = df['purchase_date'].dt.floor('d') - pd.offsets.MonthBegin(1)
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
df['month'] = (df['purchase_date'] - pd.offsets.MonthBegin(1)).dt.floor('d')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Last solution is create month period
by to_period
:
df['month'] = df['purchase_date'].dt.to_period('M')
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01
1 2 2015-02-05 05:07:30 2015-02
2 3 2015-02-18 17:08:51 2015-02
3 4 2015-03-21 17:07:30 2015-03
4 5 2015-03-11 18:32:56 2015-03
5 6 2015-03-03 11:02:30 2015-03
... and then to datetimes
by to_timestamp
, but it is a bit slowier:
df['month'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
print (df)
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
There are many solutions, so:
Timings:
rng = pd.date_range('1980-04-03 15:41:12', periods=100000, freq='20H')
df = pd.DataFrame({'purchase_date': rng})
print (df.head())
In [300]: %timeit df['month1'] = df['purchase_date'].values.astype('datetime64[M]')
100 loops, best of 3: 9.2 ms per loop
In [301]: %timeit df['month2'] = df['purchase_date'].dt.floor('d') - pd.offsets.MonthBegin(1)
100 loops, best of 3: 15.9 ms per loop
In [302]: %timeit df['month3'] = (df['purchase_date'] - pd.offsets.MonthBegin(1)).dt.floor('d')
100 loops, best of 3: 12.8 ms per loop
In [303]: %timeit df['month4'] = df['purchase_date'].dt.to_period('M').dt.to_timestamp()
1 loop, best of 3: 399 ms per loop
#MaxU solution
In [304]: %timeit df['month5'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)
10 loops, best of 3: 24.9 ms per loop
#MaxU solution 2
In [305]: %timeit df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)
10 loops, best of 3: 28.9 ms per loop
#Wen solution
In [306]: %timeit df['month6']= pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')
1 loop, best of 3: 214 ms per loop
We can use date offset in conjunction with Series.dt.normalize:
In [60]: df['month'] = df['purchase_date'].dt.normalize() - pd.offsets.MonthBegin(1)
In [61]: df
Out[61]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Or much nicer solution from @BradSolomon
In [95]: df['month'] = df['purchase_date'] - pd.offsets.MonthBegin(1, normalize=True)
In [96]: df
Out[96]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
Try this ..
df['month']=pd.to_datetime(df.purchase_date.astype(str).str[0:7]+'-01')
Out[187]:
user_id purchase_date month
0 1 2015-01-23 14:05:21 2015-01-01
1 2 2015-02-05 05:07:30 2015-02-01
2 3 2015-02-18 17:08:51 2015-02-01
3 4 2015-03-21 17:07:30 2015-03-01
4 5 2015-03-11 18:32:56 2015-03-01
5 6 2015-03-03 11:02:30 2015-03-01
For me df['purchase_date'] - pd.offsets.MonthBegin(1)
didn't work (it fails for the first day of the month), so I'm subtracting the days of the month like this:
df['purchase_date'] - pd.to_timedelta(df['purchase_date'].dt.day - 1, unit='d')
@Eyal: This is what I did to get the first day of the month using pd.offsets.MonthBegin
and handle the scenario where day is already first day of month.
import datetime
from_date= pd.to_datetime('2018-12-01')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not from_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
from_date= pd.to_datetime('2018-12-05')
from_date = from_date - pd.offsets.MonthBegin(1, normalize=True) if not rom_date.is_month_start else from_date
from_date
result: Timestamp('2018-12-01 00:00:00')
Most proposed solutions don't work for the first day of the month.
Following solution works for any day of the month:
df['month'] = df['purchase_date'] + pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(normalize=True)
To extract the first day of every month, you could write a little helper function that will also work if the provided date is already the first of month. The function looks like this:
def first_of_month(date):
return date + pd.offsets.MonthEnd(-1) + pd.offsets.Day(1)
You can apply
this function on pd.Series
:
df['month'] = df['purchase_date'].apply(first_of_month)
With that you will get the month
column as a Timestamp
. If you need a specific format, you might convert it with the strftime()
method.
df['month_str'] = df['month'].dt.strftime('%Y-%m-%d')
来源:https://stackoverflow.com/questions/45304531/extracting-the-first-day-of-month-of-a-datetime-type-column-in-pandas