I have a pandas dataframe where each observation has a date (as a column of entries in datetime[64] format). These dates are spread over a period of about 5 years. I would like to plot a kernel-density plot of the dates of all the observations, with the years labelled on the x-axis.
I have figured out how to create a time-delta relative to some reference date and then create a density plot of the number of hours/days/years between each observation and the reference date:
df['relativeDate'].astype('timedelta64[D]').plot(kind='kde')
But this isn't exactly what I want: If I convert to year-deltas, then the x-axis is right but I lose the within-year variation. But if I take a smaller unit of time like hour or day, the x-axis labels are much harder to interpret.
What's the simplest way to make this work in Pandas?
Inspired by @JohnE 's answer, an alternative approach to convert date to numeric value is to use .toordinal()
.
import pandas as pd
import numpy as np
# simulate some artificial data
# ===============================
np.random.seed(0)
dates = pd.date_range('2010-01-01', periods=31, freq='D')
df = pd.DataFrame(np.random.choice(dates,100), columns=['dates'])
# use toordinal() to get datenum
df['ordinal'] = [x.toordinal() for x in df.dates]
print(df)
dates ordinal
0 2010-01-13 733785
1 2010-01-16 733788
2 2010-01-22 733794
3 2010-01-01 733773
4 2010-01-04 733776
5 2010-01-28 733800
6 2010-01-04 733776
7 2010-01-08 733780
8 2010-01-10 733782
9 2010-01-20 733792
.. ... ...
90 2010-01-19 733791
91 2010-01-28 733800
92 2010-01-01 733773
93 2010-01-15 733787
94 2010-01-04 733776
95 2010-01-22 733794
96 2010-01-13 733785
97 2010-01-26 733798
98 2010-01-11 733783
99 2010-01-21 733793
[100 rows x 2 columns]
# plot non-parametric kde on numeric datenum
ax = df['ordinal'].plot(kind='kde')
# rename the xticks with labels
x_ticks = ax.get_xticks()
ax.set_xticks(x_ticks[::2])
xlabels = [datetime.datetime.fromordinal(int(x)).strftime('%Y-%m-%d') for x in x_ticks[::2]]
ax.set_xticklabels(xlabels)
I imagine there is some better and automatic way to do this, but if not then this ought to be a decent workaround. First, let's set up some sample data:
np.random.seed(479)
start_date = '2011-1-1'
df = pd.DataFrame({ 'date':np.random.choice(
pd.date_range(start_date, periods=365*5, freq='D'), 50) })
df['rel'] = df['date'] - pd.to_datetime(start_date)
df.rel = df.rel.astype('timedelta64[D]')
date rel
0 2014-06-06 1252
1 2011-10-26 298
2 2013-08-24 966
3 2014-09-25 1363
4 2011-12-23 356
As you can see, 'rel' is just the number of days since the starting day. It's essentially an integer, so all you really need to do is normalize it with respect to the starting date.
df['year_as_float'] = pd.to_datetime(start_date).year + df.rel / 365.
date rel year_as_float
0 2014-06-06 1252 2014.430137
1 2011-10-26 298 2011.816438
2 2013-08-24 966 2013.646575
3 2014-09-25 1363 2014.734247
4 2011-12-23 356 2011.975342
You'd need to adjust that slightly for a date not starting on Jan 1. That's also ignoring any leap years which really isn't a practical issue if you're just producing a KDE plot over 5 years, but it could matter depending on what else you might want to do.
Here's the plot
df['year_as_float']d.plot(kind='kde')
来源:https://stackoverflow.com/questions/31348737/how-to-plot-kernel-density-plot-of-dates-in-pandas