I have a pandas dataframe with a datetime column. I would like to plot the distribution of the rows according to that date column, but I\'m currenty getting an unhelpful error.
I came across this question while having the same problem myself. As mentioned in comments, it seems like seaborn's distplot
doesn't support dates to work with. Unfortunately, I could not find anything in official documentation to support this claim.
I found two ways to deal with this problem. None of them is perfect, yet that's the best I found.
Option 1: Convert dates to numbers
Convert to some numeric metric and work with that. displot
works with numbers, so if each date was represented by a number we will be okay. The mapping between dates and numbers is kinda like use MinMax Scaler. For example, We can set "2017-01-01" as 0 and "2020-06-06" as 1, and map all dates between them to values in range [0,1].
What range of numbers to use it's depends on the range of your data, could be days/months/ years or etc.
I'll demonstrate this approach with this toy example.
import pandas as pd
import datetime as dt
original_dates = ["2016-03-05", "2016-03-05", "2016-02-05", "2016-02-05", "2016-02-05", "2014-03-05"]
dates_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in original_dates]
df = pd.DataFrame({"Date":dates_list})
now dataframe is as follows:
Date
0 2016-03-05
1 2016-03-05
2 2016-02-05
3 2016-02-05
4 2016-02-05
5 2014-03-05
(not the best way to enter dates to dataframe of course, but it doesn't matter how).
Now I create a new column which will hold the difference in days between minimum date:
df["NewDate"] = df["Date"] - dt.date(2014,3,5)
df["NewDate"] = df["NewDate"].apply(lambda x: x.days)
result:
Date NewDate
0 2016-03-05 731
1 2016-03-05 731
2 2016-02-05 702
3 2016-02-05 702
4 2016-02-05 702
5 2014-03-05 0
notice I "hard-coded" the minimum date. You can use better ways to find minimum and not hard-coded it. I just wanted to get this part as fast as possible.
Now we can use displot
on our new column:
import seaborn as sns
sns.set()
ax = sns.distplot(df['NewDate'])
output:
As you can see, it shows the days instead of dates. For my personal problem it was okay to show it that way. If you want to show it as dates, some extra step is needed: Show xticks which are function of x-axis, not directly the data it self. Example with dates (pandas, matplotlib)
As I said earlier, I used scaling by days difference but you can do the same with months or years. Depends on the data.
Option 2: Use histogram directly without seaborn's displot
In this question: Can Pandas plot a histogram of dates? there is an answer how to plot histogram with dates, using pandas's groupby
.
It's not the same as displot
, but it can be close-enough solution (as displot eventually is based on matplotlib's hist).
You could convert the dates to Categorical type, and plot the resulting codes (which are integers). Then, label the x-ticks with the Date (as category).
import pandas as pd
import seaborn as sns
original_dates = [
"2016-03-05", "2016-03-05", "2016-02-05",
"2016-02-05", "2016-02-05", "2014-03-05"]
dates_list = pd.to_datetime(original_dates)
df = pd.DataFrame({"Date": dates_list})
df['date-as-cat'] = df['Date'].astype('category') # new
df['codes'] = df['date-as-cat'].cat.codes # new
print(df)
print(df.dtypes)
Date date-as-cat codes
0 2016-03-05 2016-03-05 2
1 2016-03-05 2016-03-05 2
2 2016-02-05 2016-02-05 1
3 2016-02-05 2016-02-05 1
4 2016-02-05 2016-02-05 1
5 2014-03-05 2014-03-05 0
Date datetime64[ns]
date-as-cat category
codes int8
dtype: object
The date-as-code and date-as-category info is obtained like this:
x = df[['codes', 'date-as-cat']].drop_duplicates().sort_values('codes')
print(x)
codes date-as-cat
5 0 2014-03-05
2 1 2016-02-05
0 2 2016-03-05