How to concatenate multiple csv to xarray and define coordinates?

后端 未结 2 1382
攒了一身酷
攒了一身酷 2020-12-22 12:30

I have multiple csv-files, with the same rows and columns and their contained data varies depending on the date. Each csv-file is affiliated with a different date, listed in

相关标签:
2条回答
  • 2020-12-22 12:55

    Recall that although it introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, xarray is inspired by and borrows heavily from pandas. So, to answer the question you can proceed as follows.

    from glob import glob
    import numpy as np
    import pandas as pd
    
    # Get the list of all the csv files in data path
    csv_flist = glob(data_path + "/*.csv") 
    
    df_list = []
    for _file in csv_flist:
        # get the file name from the data path
        file_name = _file.split("/")[-1]
        
        # extract the date from a file name, e.g. "data.2018-06-01.csv"
        date = file_name.split(".")[1]
        
        # read the read the data in _file
        df = pd.read_csv(_file)
        
        # add a column date knowing that all the data in df are recorded at the same date
        df["date"] = np.repeat(date, df.shape[0])
        df["date"] = df.date.astype("datetime64[ns]") # reset date column to a correct date format
        
        # append df to df_list
        df_list.append(df)
    

    Let's check e.g. the first df in df_list

    print(df_list[0])
    
        status  user_id  weight       date
    0  healthy        1      72 2019-06-01
    1    obese        2     103 2019-06-01
    

    Concatenate all the dfs along axis=0

    df_all = pd.concat(df_list, ignore_index=True).sort_index()
    print(df_all)
    
        status  user_id  weight       date
    0  healthy        1      72 2019-06-01
    1    obese        2     103 2019-06-01
    2  healthy        1      70 2018-06-01
    3  healthy        2      90 2018-06-01
    

    Set the index of df_all to a multiIndex of two levels with levels[0] = "date" and levels[1]="user_id".

    data = df_all.set_index(["date", "user_id"]).sort_index()
    print(data)
    
                         status  weight
    date       user_id                 
    2018-06-01 1        healthy      70
               2        healthy      90
    2019-06-01 1        healthy      72
               2          obese     103
    

    Subsequently, you can convert the resulting pandas.DataFrame into an xarray.Dataset using .to_xarray() as follows.

    xds = data.to_xarray()
    print(xds)
    
    <xarray.Dataset>
    Dimensions:  (date: 2, user_id: 2)
    Coordinates:
      * date     (date) datetime64[ns] 2018-06-01 2019-06-01
      * user_id  (user_id) int64 1 2
    Data variables:
        status   (date, user_id) object 'healthy' 'healthy' 'healthy' 'obese'
        weight   (date, user_id) int64 70 90 72 103
    

    Which will fully answer the question.

    0 讨论(0)
  • 2020-12-22 13:15

    Try these:

        import glob
        import pandas as pd
    
        path=(r'ur file')
        all_file = glob.glob(path + "/*.csv")
        li = []
        for filename in all_file:
        df = pd.read_csv(filename, index_col=None, header=0)
        li.append(df)
        frame = pd.concat(li, axis=0, ignore_index=True)
    
    0 讨论(0)
提交回复
热议问题