Pandas dataframes with multi-level columns:rename a specific level of column so that it's same as another level

问题

Sorry for the seemingly confusing title. I was reading Excel data using Pandas. However, the original Excel data has multiple rows for header and some of the cells are merged. It sort of looks like this:

It shows in my Jupyter Notebook like this

My plan is to just the 2nd level as my column names and drop the level0. But the original data has about 15 columns that shows as "Unnamed...", I wonder if I can rename those before dropping the level0 column names.

The desirable output looks like:

I may do this repeatedly so I didn't save it as CSV first and then read it in Pandas. Now I have spent longer than I care to admit on fixing the column names. I wonder if there is a way to do this with a function instead of renaming every individual column of interest.

Thanks.

回答1:

I think simpliest here is use list comprehension - get values of MultiIndex only if no Unnamed text:

df.columns = [first if 'Unnamed' in second else second for first, second in df.columns]
print (df)
   Purchase/sell_time  Quantity  Price Side
0 2020-04-09 15:22:00        20     43    B
1 2020-04-09 16:22:00        30     56    S

But if more levels in real data is possible some columns should be duplicated, so cannot select them (if select by duplicated column get all columns, not only one, e.g. by df['dup_column_name']).

You can test it:

print (df.columns[df.columns.duplicated(keep=False)])

Then I suggest join all unnamed levels for prevent it:

df.columns = ['_'.join(y for y in x if 'Unnamed' not in y) for x in df.columns]
print (df)
   Purchase/sell_time  Purchase/sell_time_Quantity  Purchase/sell_time_Price  \
0 2020-04-09 15:22:00                           20                        43   
1 2020-04-09 16:22:00                           30                        56   

  Side  
0    B  
1    S

回答2:

your columns are multiindex, and index are immutable, meaning you can't change only a part of them. This is why I suggest to retrieve both levels of the multiindex, then to create a array with your desired columns and to replace the DataFrame column with this, as follows:

# First I reproduce your dataframe
df1 = pd.DataFrame({("Purchase/sell_time","Unnamed:"):  pd.date_range("2020-04-09 15:22:00", 
                                                        freq="H", periods = 2),
                    ("Purchase/sell_time", "Quantity"): [20,30],
                    ("Purchase/sell_time", "Price"): [43, 56],
                    ("Side", "Unnamed:") : ["B", "S"]})
df1 = df1.sort_index()

It looks like this:

 Purchase/sell_time                    Side
             Unnamed: Quantity Price Unnamed:
0 2020-04-09 15:22:00       20    43        B
1 2020-04-09 16:22:00       30    56        S

The column is a multiindex as you can see:

MultiIndex([('Purchase/sell_time', 'Unnamed:'),
            ('Purchase/sell_time', 'Quantity'),
            ('Purchase/sell_time',    'Price'),
            (              'Side', 'Unnamed:')],
           )

# I retrieve the first and second level of the multiindex then create a array conditionnally 
# on the second level not starting with "Unnamed" 
first_header = df1.columns.get_level_values(0)
second_header = df1.columns.get_level_values(1)
merge_header = np.where(second_header.str.startswith("Unnamed:"),
                        first_header, second_header)
df1.columns = merge_header

Here is the result:

 Purchase/sell_time  Quantity  Price Side
0 2020-04-09 15:22:00        20     43    B
1 2020-04-09 16:22:00        30     56    S

Hope it helps

来源：https://stackoverflow.com/questions/61111336/pandas-dataframes-with-multi-level-columnsrename-a-specific-level-of-column-so

标签

python

pandas

rename