How to apply a xarray u_function over NetCDF and return a 2D-array (multiple new variables) to the DataSet

问题

I am trying to use the xarray apply_ufunc to apply a given function f over all pairs of coordinates (i.e. pixels) in the Dataset.

The function f returns a 2D array (NxN matrix) as result. Therefore, the resultant Dataset would have several new variables after the analysis: a total of M new variables.

The function f does work just fine. So, the error does not seem to be coming from it.

A possible problem may the structure that the 2D array is returning from f. As far as I understand, xarray.apply_ufunc requires that the resultant array be structured in tuples. So, I even tried to convert the 2D array into a tuple of arrays, but nothing worked so far.

The situation can be checked elsewhere in other works works as well. In this present link, the author has to run two times the same linear regression fitting function over the original Dataset in order to retrieve all parameters from the regression (beta_0 and alpha).

Therefore, I would like to know, if xarray.apply_ufunc is capable of operating reduction functions as in the link above (or in the snippet code below) that returns multiple new variables.

Below I present a reproducible code involving the discussed problem. Notice that the function f returns a 2D-array. The depth of the second dimension is 4. Therefore, I expect to have a resultant Dataset with 4 new variables after the whole processing.

import numpy as np
import xarray as xr


x_size = 10
y_size = 10
time_size = 30

lon = np.arange(50, 50+x_size)
lat = np.arange(10, 10+y_size)
time = np.arange(10, 10+time_size)

array = np.random.randn(y_size, x_size, time_size)

ds = xr.DataArray(
    data=array, 
    coords = {'lon':lon, 'lat':lat, 'time':time}, 
    dims=('lon', 'lat', 'time')
)

def f (x):
    return (x, x**2, x**3, x**4)

def f_xarray(ds, dim=['time'], dask='allowed', new_dim_name=['predicted']):   
    filtered = xr.apply_ufunc(
        f,
        ds,
        dask=dask,
        vectorize=True,
        input_core_dims=[dim],
        #exclude_dims = dim, # This must not be setted.
        output_core_dims= [['x', 'x2', 'x3', 'x4']], #[new_dim_name],
        #kwargs=kwargs,
        #output_dtypes=[float],
        #dataset_join='outer',
        #dataset_fill_value=np.nan,
    ).compute()
    return filtered


ds2 = f_xarray(ds)

# Error message returned: 
# ValueError: wrong number of outputs from pyfunc: expected 1, got 4

回答1:

It is difficult to get familiar with xarray.apply_ufunc it allows a really wide range of possibilities and it is not always clear how to make the most out of it. In this case, the error is due to input_core_dims and output_core_dims. I'll first extend their docs emphasizing on what I believe has caused the confusion and then provide a couple of solutions. Their docs are:

input_core_dims

List of the same length as args giving the list of core dimensions on each input argument that should not be broadcast. By default, we assume there are no core dimensions on any input arguments.

For example, input_core_dims=[[], ['time']] indicates that all dimensions on the first argument and all dimensions other than ‘time’ on the second argument should be broadcast.

Core dimensions are automatically moved to the last axes of input variables before applying func, which facilitates using NumPy style generalized ufuncs [2].

It takes care of 2 important and related aspects of the computation. First, it defines the dimensions to be broadcast, this is particularly important because the shape of the output is assumed to be the same as the shape defined by these broadcasted dimensions (when this is not the case, output_core_dims must be used). Secondly, the input_core_dims are moved to the end. Below there are two examples:

We can apply a function that does not modify the shape without any extra argument to apply_ufunc:

xr.apply_ufunc(lambda x: x**2, ds)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30)>
array([[[6.20066642e+00, 1.68502086e+00, 9.77868899e-01, ...,
         ...,
         2.28979668e+00, 1.76491683e+00, 2.17085164e+00]]])
Coordinates:
  * lon      (lon) int64 50 51 52 53 54 55 56 57 58 59
  * lat      (lat) int64 10 11 12 13 14 15 16 17 18 19
  * time     (time) int64 10 11 12 13 14 15 16 17 18 ... 32 33 34 35 36 37 38 39

To calculate the mean along lon dimension for instance, we reduce one of the dimensions, therefore, the output will have one dimension less than the input: we must pass lon as an input_core_dim:

xr.apply_ufunc(lambda x: x.mean(axis=-1), ds, input_core_dims=[["lon"]])
# Output
<xarray.DataArray (lat: 10, time: 30)>
array([[ 7.72163214e-01,  3.98689228e-01,  9.36398702e-03,
         ...,
        -3.70034281e-01, -4.57979868e-01,  1.29770762e-01]])
Coordinates:
  * lat      (lat) int64 10 11 12 13 14 15 16 17 18 19
  * time     (time) int64 10 11 12 13 14 15 16 17 18 ... 32 33 34 35 36 37 38 39

Note that we are doing the mean on axis=-1 even though lon is the first dimension because it will be moved to the end as it is an input_core_dims. We could therefore calculate the mean along lat dim using input_core_dims=[["lon"]].

Note also the format of input_core_dims, it must be a list of lists: List of the same length as args giving the list of core dimensions. A tuple of tuples (or any sequence) is also valid, however, note that with tuples the 1 element case it is (("lon",),) not (("lon")).

output_core_dims

List of the same length as the number of output arguments from func, giving the list of core dimensions on each output that were not broadcast on the inputs. By default, we assume that func outputs exactly one array, with axes corresponding to each broadcast dimension.

Core dimensions are assumed to appear as the last dimensions of each output in the provided order.

Here again, output_core_dims is a list of lists. It must be used when there are multiple outputs (that is, func returns a tuple) or when the output has extra dimensions in addition to the broadcasted dimensions. Obviously, if there are multiple outputs with extra dims, it must be used too. We'll use the two possible solutions as examples.

Solution 1

Use the function posted in the question. This function returns a tuple, therefore we need to use output_core_dims even though the shape of the arrays is not modified. As there are actually no extra dims, we'll pass an empty list per output:

xr.apply_ufunc(
    f,
    ds,
    output_core_dims= [[] for _ in range(4)], 
)

This will return a tuple of DataArrays, its output would be exactly the same as f(ds).

Solution 2

We'll now modify the function to output a single array, stacking all 4 outputs in the tuple. Note that we have to make sure that this new dimension is added at the end of the array:

def f2(x):
    return np.stack((x, x**2, x**3, x**4), axis=-1)

xr.apply_ufunc(
    f2,
    ds,
    output_core_dims= [["predictions"]], 
)
# Output
<xarray.DataArray (lon: 10, lat: 10, time: 30, predictions: 4)>
array([[[[ 2.49011374e+00,  6.20066642e+00,  1.54403646e+01,
           ...,
           4.71259686e+00]]]])
Coordinates:
  * lon      (lon) int64 50 51 52 53 54 55 56 57 58 59
  * lat      (lat) int64 10 11 12 13 14 15 16 17 18 19
  * time     (time) int64 10 11 12 13 14 15 16 17 18 ... 32 33 34 35 36 37 38 39
Dimensions without coordinates: predictions

We have now passed predictions as output core dim which makes the output have predictions as a new dimension in addition to the original 3. Here the output is not equivalent to f2(ds) (it returns a numpy array) anymore because thanks to using apply_ufunc we have been able to perform several functions and stacking without loosing the labels.

Side note: it is generally not recommended to use mutable objects as defaults arguments in functions: see for example "Least Astonishment" and the Mutable Default Argument

来源：https://stackoverflow.com/questions/58719696/how-to-apply-a-xarray-u-function-over-netcdf-and-return-a-2d-array-multiple-new

标签

python

netcdf

python-xarray