Import multiple csv files into pandas and concatenate into one DataFrame

前端 未结 16 1725
既然无缘
既然无缘 2020-11-21 07:47

I would like to read several csv files from a directory into pandas and concatenate them into one big DataFrame. I have not been able to figure it out though. Here is what I

相关标签:
16条回答
  • 2020-11-21 07:48

    If you have same columns in all your csv files then you can try the code below. I have added header=0 so that after reading csv first row can be assigned as the column names.

    import pandas as pd
    import glob
    
    path = r'C:\DRO\DCL_rawdata_files' # use your path
    all_files = glob.glob(path + "/*.csv")
    
    li = []
    
    for filename in all_files:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)
    
    frame = pd.concat(li, axis=0, ignore_index=True)
    
    0 讨论(0)
  • 2020-11-21 07:48

    You can do it this way also:

    import pandas as pd
    import os
    
    new_df = pd.DataFrame()
    for r, d, f in os.walk(csv_folder_path):
        for file in f:
            complete_file_path = csv_folder_path+file
            read_file = pd.read_csv(complete_file_path)
            new_df = new_df.append(read_file, ignore_index=True)
    
    
    new_df.shape
    
    0 讨论(0)
  • 2020-11-21 07:55

    Edit: I googled my way into https://stackoverflow.com/a/21232849/186078. However of late I am finding it faster to do any manipulation using numpy and then assigning it once to dataframe rather than manipulating the dataframe itself on an iterative basis and it seems to work in this solution too.

    I do sincerely want anyone hitting this page to consider this approach, but don't want to attach this huge piece of code as a comment and making it less readable.

    You can leverage numpy to really speed up the dataframe concatenation.

    import os
    import glob
    import pandas as pd
    import numpy as np
    
    path = "my_dir_full_path"
    allFiles = glob.glob(os.path.join(path,"*.csv"))
    
    
    np_array_list = []
    for file_ in allFiles:
        df = pd.read_csv(file_,index_col=None, header=0)
        np_array_list.append(df.as_matrix())
    
    comb_np_array = np.vstack(np_array_list)
    big_frame = pd.DataFrame(comb_np_array)
    
    big_frame.columns = ["col1","col2"....]
    

    Timing stats:

    total files :192
    avg lines per file :8492
    --approach 1 without numpy -- 8.248656988143921 seconds ---
    total records old :1630571
    --approach 2 with numpy -- 2.289292573928833 seconds ---
    
    0 讨论(0)
  • 2020-11-21 07:59
    import pandas as pd
    import glob
    
    path = r'C:\DRO\DCL_rawdata_files' # use your path
    file_path_list = glob.glob(path + "/*.csv")
    
    file_iter = iter(file_path_list)
    
    list_df_csv = []
    list_df_csv.append(pd.read_csv(next(file_iter)))
    
    for file in file_iter:
        lsit_df_csv.append(pd.read_csv(file, header=0))
    df = pd.concat(lsit_df_csv, ignore_index=True)
    
    0 讨论(0)
  • 2020-11-21 08:00

    If you want to search recursively (Python 3.5 or above), you can do the following:

    from glob import iglob
    import pandas as pd
    
    path = r'C:\user\your\path\**\*.csv'
    
    all_rec = iglob(path, recursive=True)     
    dataframes = (pd.read_csv(f) for f in all_rec)
    big_dataframe = pd.concat(dataframes, ignore_index=True)
    

    Note that the three last lines can be expressed in one single line:

    df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)
    

    You can find the documentation of ** here. Also, I used iglobinstead of glob, as it returns an iterator instead of a list.



    EDIT: Multiplatform recursive function:

    You can wrap the above into a multiplatform function (Linux, Windows, Mac), so you can do:

    df = read_df_rec('C:\user\your\path', *.csv)
    

    Here is the function:

    from glob import iglob
    from os.path import join
    import pandas as pd
    
    def read_df_rec(path, fn_regex=r'*.csv'):
        return pd.concat((pd.read_csv(f) for f in iglob(
            join(path, '**', fn_regex), recursive=True)), ignore_index=True)
    
    0 讨论(0)
  • 2020-11-21 08:02

    Based on @Sid's good answer.

    Before concatenating, you can load csv files into an intermediate dictionary which gives access to each data set based on the file name (in the form dict_of_df['filename.csv']). Such a dictionary can help you identify issues with heterogeneous data formats, when column names are not aligned for example.

    Import modules and locate file paths:

    import os
    import glob
    import pandas
    from collections import OrderedDict
    path =r'C:\DRO\DCL_rawdata_files'
    filenames = glob.glob(path + "/*.csv")
    

    Note: OrderedDict is not necessary, but it'll keep the order of files which might be useful for analysis.

    Load csv files into a dictionary. Then concatenate:

    dict_of_df = OrderedDict((f, pandas.read_csv(f)) for f in filenames)
    pandas.concat(dict_of_df, sort=True)
    

    Keys are file names f and values are the data frame content of csv files. Instead of using f as a dictionary key, you can also use os.path.basename(f) or other os.path methods to reduce the size of the key in the dictionary to only the smaller part that is relevant.

    0 讨论(0)
提交回复
热议问题