问题
Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:
It shows in my Jupyter Notebook like this
My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.
The desirable output looks like:
I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.
Thanks.
回答1:
I think simpliest here is use list comprehension - get values of MultiIndex
only if no Unnamed
text:
df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']
).
You can test it:
print (df.columns[df.columns.duplicated(keep=False)])
Then I suggest join all unnamed levels for prevent it:
df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
Purchase/sell_time Purchase/sell_time_Quantity Purchase/sell_time_Price \
0 2020-04-09 15:22:00 20 43
1 2020-04-09 16:22:00 30 56
Side
0 B
1 S
回答2:
your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create a array with your desired columns and to replace the DataFrame column with this, as follows:
# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"): pd.date_range("2020-04-09 15:22:00",
freq="H", periods = 2),
("Purchase/sell_time", "Quantity"): [20,30],
("Purchase/sell_time", "Price"): [43, 56],
("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()
It looks like this:
Purchase/sell_time Side
Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
The column is a multiindex as you can see:
MultiIndex([('Purchase/sell_time', 'Unnamed:'),
('Purchase/sell_time', 'Quantity'),
('Purchase/sell_time', 'Price'),
( 'Side', 'Unnamed:')],
)
# I retrieve the first and second level of the multiindex then create a array conditionnally
# on the second level not starting with "Unnamed"
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
first_header, second_header)
df1.columns = merge_header
Here is the result:
Purchase/sell_time Quantity Price Side
0 2020-04-09 15:22:00 20 43 B
1 2020-04-09 16:22:00 30 56 S
Hope it helps
来源:https://stackoverflow.com/questions/61111336/pandas-dataframes-with-multi-level-columnsrename-a-specific-level-of-column-so