I have a Series object that has:
date price
dec 12
may 15
apr 13
..
Problem statement:
use Sort_Dataframeby_Month function to sort month names in chronological order
Packages need to install.
$ pip install sorted-months-weekdays
$ pip install sort-dataframeby-monthorweek
example:
from sorted_months_weekdays import *
from sort_dataframeby_monthorweek import *
df = pd.DataFrame([['Jan',23],['Jan',16],['Dec',35],['Apr',79],['Mar',53],['Mar',12],['Feb',3]], columns=['Month','Sum'])
df
Out[11]:
Month Sum
0 Jan 23
1 Jan 16
2 Dec 35
3 Apr 79
4 Mar 53
5 Mar 12
6 Feb 3
To sort dataframe by Month use below function
Sort_Dataframeby_Month(df=df,monthcolumnname='Month')
Out[14]:
Month Sum
0 Jan 23
1 Jan 16
2 Feb 3
3 Mar 53
4 Mar 12
5 Apr 79
6 Dec 35
Thanks @Brad Solomon for offering a faster way to capitalize string!
Note 1 @Brad Solomon's answer using pd.categorical should save your resources more than my answer. He showed how to assign order to your categorical data. You should not miss it :P
Alternatively, you can use.
df = pd.DataFrame([["dec", 12], ["jan", 40], ["mar", 11], ["aug", 21],
["aug", 11], ["jan", 11], ["jan", 1]],
columns=["Month", "Price"])
# Preprocessing: capitalize `jan`, `dec` to `Jan` and `Dec`
df["Month"] = df["Month"].str.capitalize()
# Now the dataset should look like
# Month Price
# -----------
# Dec XX
# Jan XX
# Apr XX
# make it a datetime so that we can sort it:
# use %b because the data use the abbriviation of month
df["Month"] = pd.to_datetime(df.Month, format='%b', errors='coerce').dt.month
df = df.sort_values(by="Month")
total = (df.groupby(df['Month"])['Price'].mean())
# total
Month
1 17.333333
3 11.000000
8 16.000000
12 12.000000
Note 2
groupby
by default will sort group keys for you. Be aware to use the same key to sort and groupby in the df = df.sort_values(by=SAME_KEY)
and total = (df.groupby(df[SAME_KEY])['Price'].mean()).
Otherwise, one may gets unintended behavior. See Groupby preserve order among groups? In which way? for more information.
Note 3
A more computationally efficient way is first compute mean and then do sorting on months. In this way, you only need to sort on 12 items rather than the whole df
. It will reduce the computational cost if one don't need df
to be sorted.
Note 4 For people already have month
as index, and wonder how to make it categorical, take a look at pandas.CategoricalIndex @jezrael has a working example on making categorical index ordered in Pandas series sort by month index
You can add the numerical month value together with the name in the index (i.e "01 January"), do a sort then strip off the number:
total=(df.groupby(df['date'].dt.strftime('%m %B'))['price'].mean()).sort_index()
It may look sth like this:
01 January xxx
02 February yyy
03 March zzz
04 April ttt
total.index = [ x.split()[1] for x in total.index ]
January xxx
February yyy
March zzz
April ttt
You can use categorical data to enable proper sorting:
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df['months'] = pd.Categorical(df['months'], categories=months, ordered=True)
df.sort_values(...) # same as you have now; can use inplace=True
When you specify the categories, pandas remembers the order of specification as the default sort order.
Docs: Pandas categories > sorting & order.
I would use the calender
module and reindex
:
series.str.capitalize helps capitalizing the series , then we create a dictionary with the calender
module and map with the series to get month number.
Once we have the month number we can sort_values() and get the index. Then reindex .
import calendar
df.date=df.date.str.capitalize() #capitalizes the series
d={i:e for e,i in enumerate(calendar.month_abbr)} #creates a dictionary
#d={i[:3]:e for e,i in enumerate(calendar.month_name)}
df.reindex(df.date.map(d).sort_values().index) #map + sort_values + reindex with index
date price
2 Apr 13
1 May 15
0 Dec 12
You should consider re-indexing it based on axis 0 (indexes)
new_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df1 = df.reindex(new_order, axis=0)