The data in excel sheets is stored as follows:
Area | Product1 | Product2 | Product3
| sales|sales.Valu
Suppose your DataFrame is df
:
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame([
(nan, nan, nan, 'Auto loan', nan)
, ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
, 'Portfolio Outstanding')
, (3000, 'Name1', 'Central', 0, 0)
, (3001, 'Name2', 'Central', 0, 0)
])
so that it looks like this:
0 1 2 3 4
0 NaN NaN NaN Auto loan NaN
1 Branch Code Branch Name Region No of accounts Portfolio Outstanding
2 3000 Name1 Central 0 0
3 3001 Name2 Central 0 0
Then first forward fill the NaNs in the first two rows (thus propagating 'Auto loan', for example).
df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)
Next fill in the remaining NaNs with empty strings:
df.iloc[0:2] = df.iloc[0:2].fillna('')
Now join the two rows together with .
and assign that as the column level values:
df.columns = df.iloc[0:2].apply(lambda x: '.'.join([y for y in x if y]), axis=0)
And finally, remove the first two rows:
df = df.iloc[2:]
This yields
Branch Code Branch Name Region Auto loan.No of accounts \
2 3000 Name1 Central 0
3 3001 Name2 Central 0
Auto loan.Portfolio Outstanding
2 0
3 0
Alternatively, you could create a MultiIndex column instead of creating a flat column index:
import numpy as np
import pandas as pd
nan = np.nan
df = pd.DataFrame([
(nan, nan, nan, 'Auto loan', nan)
, ('Branch Code', 'Branch Name', 'Region', 'No of accounts'
, 'Portfolio Outstanding')
, (3000, 'Name1', 'Central', 0, 0)
, (3001, 'Name2', 'Central', 0, 0)
])
df.iloc[0:2] = df.iloc[0:2].fillna(method='ffill', axis=1)
df.iloc[0:2] = df.iloc[0:2].fillna('Area')
df.columns = pd.MultiIndex.from_tuples(
zip(*df.iloc[0:2].to_records(index=False).tolist()))
df = df.iloc[2:]
Now df
looks like this:
Area Auto loan
Branch Code Branch Name Region No of accounts Portfolio Outstanding
2 3000 Name1 Central 0 0
3 3001 Name2 Central 0 0
the column is a MultiIndex:
In [275]: df.columns
Out[275]:
MultiIndex(levels=[[u'Area', u'Auto loan'], [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']],
labels=[[0, 0, 0, 1, 1], [0, 1, 4, 2, 3]])
The column has two levels. The first level has values [u'Area', u'Auto loan']
, the second has values [u'Branch Code', u'Branch Name', u'No of accounts', u'Portfolio Outstanding', u'Region']
.
You can then access a column by specifing the value from both levels:
print(df.loc[:, ('Area', 'Branch Name')])
# 2 Name1
# 3 Name2
# Name: (Area, Branch Name), dtype: object
print(df.loc[:, ('Auto loan', 'No of accounts')])
# 2 0
# 3 0
# Name: (Auto loan, No of accounts), dtype: object
One advantage of using a MultiIndex is that you can easily select all columns which have a certain level value. For instance, to select the sub-DataFrame having to do with Auto loans
you could use:
In [279]: df.loc[:, 'Auto loan']
Out[279]:
No of accounts Portfolio Outstanding
2 0 0
3 0 0
For more on selecting rows and columns from a MultiIndex, see MultiIndexing Using Slicers.