I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I
Based on @Sid's good answer.
Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']
). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.
import os
import glob
import pandas
from collections import OrderedDict
path =r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
Note: OrderedDict
is not necessary,
but it'll keep the order of files which might be useful for analysis.
dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
pandas.concat(dict_of_df, sort=True)
Keys are file names f
and values are the data frame content of csv files.
Instead of using f
as a dictionary key, you can also use os.path.basename(f)
or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.