问题
I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename. So far I've coded the following:
import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df
This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row. I'm looking for each row to be populated with it's associated csv file. Not just the last file in the folder.
Any assistance for this newbie is much appreciated.
回答1:
I think you need assign for add new column in loop
, also parameter ignore_index=True
was added to concat for remove duplicates in index
:
Files for test are a.csv, b.csv, c.csv.
import pandas as pd
import glob, os
files = glob.glob('files/*.csv')
print (files)
['files\\a.csv', 'files\\b.csv', 'files\\c.csv']
files = glob.glob('files/*.csv')
print (files)
['files\\a.csv', 'files\\b.csv', 'files\\c.csv']
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
a b c d New
0 0 1 2 5 a.csv
1 1 5 8 3 a.csv
2 0 9 6 5 b.csv
3 1 6 4 2 b.csv
4 0 7 1 7 c.csv
5 1 3 2 6 c.csv
files = glob.glob('files/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) for fp in files])
print (df)
a b c d New
0 0 1 2 5 a
1 1 5 8 3 a
2 0 9 6 5 b
3 1 6 4 2 b
4 0 7 1 7 c
5 1 3 2 6 c
回答2:
Firstly, you have no csv variable defined.
But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file. Ideally, you can use glob again to get all filenames, then set that as a new column.
#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))
#now set the csv into a pd series
csv_paths = pd.Series(csvs)
df['file_name'] = csv_paths.values
来源:https://stackoverflow.com/questions/42756696/read-multiple-csv-files-and-add-filename-as-new-column-in-pandas