I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I
The Dask library can read a dataframe from multiple files:
>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')
(Source: http://dask.pydata.org/en/latest/examples/dataframe-csv.html)
The Dask dataframes implement a subset of the Pandas dataframe API. If all the data fits into memory, you can call df.compute() to convert the dataframe into a Pandas dataframe.
If the multiple csv files are zipped, you may use zipfile to read all and concatenate as below:
import zipfile
import numpy as np
import pandas as pd
ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')
train=[]
for f in range(0,len(ziptrain.namelist())):
if (f == 0):
train = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
else:
my_df = pd.read_csv(ziptrain.open(ziptrain.namelist()[f]))
train = (pd.DataFrame(np.concatenate((train,my_df),axis=0),
columns=list(my_df.columns.values)))
Import two or more csv
's without having to make a list of names.
import glob
df = pd.concat(map(pd.read_csv, glob.glob('data/*.csv')))
Another on-liner with list comprehension which allows to use arguments with read_csv.
df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])
An alternative to darindaCoder's answer:
path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path, "*.csv")) # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
one liner using map
, but if you'd like to specify additional args, you could do:
import pandas as pd
import glob
import functools
df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
glob.glob("data/*.csv")))
Note: map
by itself does not let you supply additional args.