Get hourly average for each month from a netcdf file

问题

I have a netCDF file with the time dimension containing data by the hour for 2 years. I want to average it to get an hourly average for each hour of the day for each month. I tried this:

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')    
ds.groupby(['time.month', 'time.hour']).mean('time')

but I get this error:

*** TypeError: `group` must be an xarray.DataArray or the name of an xarray variable or dimension

How can I fix this? If I do this:

ds.groupby('time.month', 'time.hour').mean('time')

I do not get an error but the result has a time dimension of 12 (one value for each month), whereas I want an hourly average for each month i.e. 24 values for each of 12 months. Data is available here: https://www.dropbox.com/s/yqgg80wn8bjdksy/ecmwf_usa_2015.nc?dl=0

回答1:

You are getting TypeError: group must be an xarray.DataArray or the name of an xarray variable or dimension because ds.groupby() is supposed to take xarray dataset variable or array , you passed a list of variables.

You have two options:

1. xarray bins --> group by hour

Refer group by documentation group by documentation and convert dataset into splits or bins and then apply groupby('time.hour')

This is because applying groupby on month and then hour one by one or by together is aggregating all the data. If you split them you into month data you would apply group by - mean on each month.

You can try this approach as mentioned in documentation:

GroupBy: split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

Split your data into multiple independent groups. => Split them by months using groupby_bins

Apply some function to each group. => apply group by

Combine your groups back into a single data object. **apply aggregate function mean('time')

2. convert it into pandas dataframe and use group by

Warning : Not all netcdfs are convertable to panda dataframe , there may be meta data loss while conversion.

Convert ds into pandas dataframe by df = ds.to_dataframe()and use group by as you require by using pandas.Grouperlike

df.set_index('time').groupby([pd.Grouper(freq='1M'), 't2m']).mean()

Note : I saw couple of answers with pandas.TimeGrouper but its deprecated and one has to use pandas.Grouper now.

Since your data set is too big and question does not have minimized data and working on it consuming heavy resources I would suggest to look at these examples on pandas

group by weekdays
group by time
groupby-date-range-depending-on-each-row
group-and-count-rows-by-month-and-year

回答2:

In case you didn't solve the problem yet, you can do it this way:

# define a function with the hourly calculation:
def hour_mean(x):
     return x.groupby('time.hour').mean('time')

# group by month, then apply the function:
ds.groupby('time.month').apply(hour_mean)

This is the same strategy as the one in the first option given by @Prateek and based on the documentation, but the documentation was not that clear for me, so I hope this helps clarify. You can't apply a groupby operation to a groupby object so you have to build it into a function and use .apply() for it to work.

回答3:

Another solution for the problem of retrieving a multitemporal groupby function over a netcdf file using xarray library is to use the xarray-DataArray method called "resample" coupled with the "groupby" method. This approach is also available for xarray-DataSet objects.

Through this approach, one can retrieve values like monthly-hourly mean, or other kinds of temporal aggregation (i.e.: annual monthly mean, bi-annual three-monthly sum, etc.).

The example below uses the standard xarray tutorial dataset of daily air temperature (Tair). Notice that I had to convert the time dimension of the tutorial data into a pandas datetime object. If this conversion were not applied, the resampling function would fail, and an error message would appear (see below):

Error message:

"TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'Index'"

Despite that timeindex problem (which could be another Issue for discussion in StackOverFlow), the code below presents two possible solutions for the multitemporal grouping problem in xarray objects. The first uses the xarray.core.groupby.DataArrayGroupBy class, while the second only uses the groupby method from the normal xarray-dataArray and xarray-DataSet classes.

Sincerely yours,

Philipe Riskalla Leal

Code snippet:

ds = xr.tutorial.open_dataset('rasm').load()

def parse_datetime(time):
    return pd.to_datetime([str(x) for x in time])

ds.coords['time'] = parse_datetime(ds.coords['time'].values)


# 1° Option for multitemporal aggregation:


time_grouper = pd.Grouper(freq='Y')

grouped = xr.core.groupby.DataArrayGroupBy(ds, 'time', grouper=time_grouper)

for idx, sub_da in grouped:
    print(sub_da.resample({'time':'3M'}).mean().coords)


 # 2° Option for multitemporal aggregation:


grouped = ds.groupby('time.year')
for idx, sub_da in grouped:
    print(sub_da.resample({'time':'3M'}).mean().coords)

回答4:

Not a python solution, but I think this is how you could do it using CDO in a bash script loop:

# loop over months:
for i in {1..12}; do
   # This gives the hourly mean for each month separately 
   cdo yhourmean -selmon,${i} datafile.nc mon${i}.nc
done
# merge the files
cdo mergetime mon*.nc hourlyfile.nc
rm -f mon*.nc # clean up the files

Note that if you data doesn't start in January then you will get a "jump" in the final file time... I think that can be sorted by setting the year after the yhourmean command if that is an issue for you.

回答5:

Whith this

import xarray as xr
ds = xr.open_mfdataset('ecmwf_usa_2015.nc')
print ds.groupby('time.hour' ).mean('time')

I get somthing like this:

Dimensions: (hour: 24, latitude: 93, longitude: 281) Coordinates:

longitude (longitude) float32 230.0 230.25 230.5 230.75 231.0 231.25 ... * latitude (latitude) float32 48.0 47.75 47.5 47.25 47.0 46.75 46.5 ... * hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ...

I think that is what you want.

来源：https://stackoverflow.com/questions/49620140/get-hourly-average-for-each-month-from-a-netcdf-file

标签

python

netcdf

xarray