I have a big excel file that contains many different sheets. All the sheets have the same structure like:
Name
col1 col2 col3 col4
1 1 2 4
4
Taking a note from this question:
import pandas as pd
file = pd.ExcelFile('file.xlsx')
names = file.sheet_names # see all sheet names
df = pd.concat([file.parse(name) for name in names])
Results:
df
Out[6]:
A B
0 1 3
1 2 4
0 5 6
1 7 8
Then you can run df.reset_index()
, to, well, reset the index.
Edit: pandas.ExcelFile.parse is, according to the pandas docs:
Equivalent to read_excel(ExcelFile, ...) See the read_excel docstring for more info on accepted parameters
First add parameter sheetname=None
for dict
of DataFrames
and skiprows=1
for omit first row and then use concat for MultiIndex DataFrame
.
Last use reset_index for column from first level:
df = pd.concat(pd.read_excel('multiple_sheets.xlsx', sheetname=None, skiprows=1))
df = df.reset_index(level=1, drop=True).rename_axis('filenames').reset_index()
Try this:
dfs = pd.read_excel(filename, sheet_name=None, skiprows=1)
this will return you a dictionary of DFs, which you can easily concatenate using pd.concat(dfs)
or as @jezrael has already posted in his answer:
df = pd.concat(pd.read_excel(filename, sheet_name=None, skiprows=1))
sheet_name: None -> All sheets as a dictionary of DataFrames
UPDATE:
Is there a way to create a variable in the resulting dataframe that identifies the sheet name from which the data comes from?
dfs = pd.read_excel(filename, sheet_name=None, skiprows=1)
assuming we've got the following dict:
In [76]: dfs
Out[76]:
{'d1': col1 col2 col3 col4
0 1 1 2 4
1 4 3 2 1, 'd2': col1 col2 col3 col4
0 3 3 4 6
1 6 5 4 3}
Now we can add a new column:
In [77]: pd.concat([df.assign(name=n) for n,df in dfs.items()])
Out[77]:
col1 col2 col3 col4 name
0 1 1 2 4 d1
1 4 3 2 1 d1
0 3 3 4 6 d2
1 6 5 4 3 d2
file_save_location='myfolder'
file_name='filename'
location = ''myfolder1'
os.chdir(location)
files_xls = glob.glob("*.xls*")
excel_names=[f for f in files_xls]
sheets = pd.ExcelFile(files_xls[0]).sheet_names
def combine_excel_to_dfs(excel_names, sheet_name):
sheet_frames = [pd.read_excel(x, sheet_name=sheet_name) for x in excel_names]
combined_df = pd.concat(sheet_frames).reset_index(drop=True)
return combined_df
i = 0
while i < len(sheets):
process = sheets[i]
consolidated_file= combine_excel_to_dfs(excel_names, process)
consolidated_file.to_csv(file_save_location+file_name+'.csv')
i = i+1
else:
"we done on consolidation part"