i have 10 different subdirectories with same file names in each directory ( 20 files per directory ) and column 0 is the index column in each file.
e.g
This can be achieved in much simple way in shell as:
find . -name "*.csv" | xargs cat > mergedCSV
(Note: Don't use .csv in extension as it will cause inconsistency with find. After this command is finished, file can be renamed as .csv
There are many ways to do this, staying in Pandas I did the following.
With the file structure
root/
├── dir1/
│ ├── data_20170101_k
│ ├── data_20170102_k
│ ├── ...
├── dir2/
│ ├── data_20170101_k
│ └── data_20170101_k
│ └── ...
└── ...
This code will work, it's a little verbose for explanation but you can shorten with implementation.
import glob
import pandas as pd
CONCAT_DIR = "/FILES_CONCAT/"
# Use glob module to return all csv files under root directory. Create DF from this.
files = pd.DataFrame([file for file in glob.glob("root/*/*")], columns=["fullpath"])
# fullpath
# 0 root\dir1\data_20170101_k.csv
# 1 root\dir1\data_20170102_k.csv
# 2 root\dir2\data_20170101_k.csv
# 3 root\dir2\data_20170102_k.csv
# Split the full path into directory and filename
files_split = files['fullpath'].str.rsplit("\\", 1, expand=True).rename(columns={0: 'path', 1:'filename'})
# path filename
# 0 root\dir1 data_20170101_k.csv
# 1 root\dir1 data_20170102_k.csv
# 2 root\dir2 data_20170101_k.csv
# 3 root\dir2 data_20170102_k.csv
# Join these into one DataFrame
files = files.join(files_split)
# fullpath path filename
# 0 root\dir1\data_20170101_k.csv root\dir1 data_20170101_k.csv
# 1 root\dir1\data_20170102_k.csv root\dir1 data_20170102_k.csv
# 2 root\dir2\data_20170101_k.csv root\dir2 data_20170101_k.csv
# 3 root\dir2\data_20170102_k.csv root\dir2 data_20170102_k.csv
# Iterate over unique filenames; read CSVs, concat DFs, save file
for f in files['filename'].unique():
paths = files[files['filename'] == f]['fullpath'] # Get list of fullpaths from unique filenames
dfs = [pd.read_csv(path, header=None) for path in paths] # Get list of dataframes from CSV file paths
concat_df = pd.concat(dfs) # Concat dataframes into one
concat_df.to_csv(CONCAT_DIR + f) # Save dataframe