How to concatenate multiple csv to xarray and define coordinates?

后端未结

关注

 2  1382

I have multiple csv-files, with the same rows and columns and their contained data varies depending on the date. Each csv-file is affiliated with a different date, listed in

相关标签:

2条回答

面向向阳花

2020-12-22 12:55

Recall that although it introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, xarray is inspired by and borrows heavily from pandas. So, to answer the question you can proceed as follows.

from glob import glob
import numpy as np
import pandas as pd

# Get the list of all the csv files in data path
csv_flist = glob(data_path + "/*.csv") 

df_list = []
for _file in csv_flist:
    # get the file name from the data path
    file_name = _file.split("/")[-1]
    
    # extract the date from a file name, e.g. "data.2018-06-01.csv"
    date = file_name.split(".")[1]
    
    # read the read the data in _file
    df = pd.read_csv(_file)
    
    # add a column date knowing that all the data in df are recorded at the same date
    df["date"] = np.repeat(date, df.shape[0])
    df["date"] = df.date.astype("datetime64[ns]") # reset date column to a correct date format
    
    # append df to df_list
    df_list.append(df)

Let's check e.g. the first df in df_list

print(df_list[0])

    status  user_id  weight       date
0  healthy        1      72 2019-06-01
1    obese        2     103 2019-06-01

Concatenate all the dfs along axis=0

df_all = pd.concat(df_list, ignore_index=True).sort_index()
print(df_all)

    status  user_id  weight       date
0  healthy        1      72 2019-06-01
1    obese        2     103 2019-06-01
2  healthy        1      70 2018-06-01
3  healthy        2      90 2018-06-01

Set the index of df_all to a multiIndex of two levels with levels[0] = "date" and levels[1]="user_id".

data = df_all.set_index(["date", "user_id"]).sort_index()
print(data)

                     status  weight
date       user_id                 
2018-06-01 1        healthy      70
           2        healthy      90
2019-06-01 1        healthy      72
           2          obese     103

Subsequently, you can convert the resulting pandas.DataFrame into an xarray.Dataset using .to_xarray() as follows.

xds = data.to_xarray()
print(xds)

<xarray.Dataset>
Dimensions:  (date: 2, user_id: 2)
Coordinates:
  * date     (date) datetime64[ns] 2018-06-01 2019-06-01
  * user_id  (user_id) int64 1 2
Data variables:
    status   (date, user_id) object 'healthy' 'healthy' 'healthy' 'obese'
    weight   (date, user_id) int64 70 90 72 103

Which will fully answer the question.

0 讨论(0)

一整个雨季

2020-12-22 13:15

Try these:

    import glob
    import pandas as pd

    path=(r'ur file')
    all_file = glob.glob(path + "/*.csv")
    li = []
    for filename in all_file:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)
    frame = pd.concat(li, axis=0, ignore_index=True)

0 讨论(0)