How do I combine multiple datasets (.h5 files) with different dimensions sizes in xarray

问题

I tried several methods to make a xarray (xr) dataset out of multiple .h5 files. The files contain data from SMAP project on soil moisture content along with other useful variables. Each variable represent a 2-D Array. The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.

Example dataset load via xr.open_dataset()

<xarray.Dataset>
Dimensions:                                     (x: 54, y: 129)
Coordinates:
    EASE_column_index_3km                       (x, y) float32 ...
    EASE_column_index_apm_3km                   (x, y) float32 ...
    EASE_row_index_3km                          (x, y) float32 ...
    EASE_row_index_apm_3km                      (x, y) float32 ...
    latitude_3km                                (x, y) float32 ...
    latitude_apm_3km                            (x, y) float32 ...
    longitude_3km                               (x, y) float32 ...
    longitude_apm_3km                           (x, y) float32 ...
Dimensions without coordinates: x, y
Data variables:
    SMAP_Sentinel_overpass_timediff_hr_3km      (x, y) timedelta64[ns] ...
    SMAP_Sentinel_overpass_timediff_hr_apm_3km  (x, y) timedelta64[ns] ...
    albedo_3km                                  (x, y) float32 ...
    albedo_apm_3km                              (x, y) float32 ...
    bare_soil_roughness_retrieved_3km           (x, y) float32 ...
    bare_soil_roughness_retrieved_apm_3km       (x, y) float32 ...
    beta_tbv_vv_3km                             (x, y) float32 ...
    beta_tbv_vv_apm_3km                         (x, y) float32 ...
    disagg_soil_moisture_3km                    (x, y) float32 ...
    disagg_soil_moisture_apm_3km                (x, y) float32 ...
    disaggregated_tb_v_qual_flag_3km            (x, y) float32 ...
    disaggregated_tb_v_qual_flag_apm_3km        (x, y) float32 ...
    gamma_vv_xpol_3km                           (x, y) float32 ...
    gamma_vv_xpol_apm_3km                       (x, y) float32 ...
    landcover_class_3km                         (x, y) float32 ...
    landcover_class_apm_3km                     (x, y) float32 ...
    retrieval_qual_flag_3km                     (x, y) float32 ...
    retrieval_qual_flag_apm_3km                 (x, y) float32 ...
    sigma0_incidence_angle_3km                  (x, y) float32 ...
    sigma0_incidence_angle_apm_3km              (x, y) float32 ...
    sigma0_vh_aggregated_3km                    (x, y) float32 ...
    sigma0_vh_aggregated_apm_3km                (x, y) float32 ...
    sigma0_vv_aggregated_3km                    (x, y) float32 ...
    sigma0_vv_aggregated_apm_3km                (x, y) float32 ...
    soil_moisture_3km                           (x, y) float32 ...
    soil_moisture_apm_3km                       (x, y) float32 ...
    soil_moisture_std_dev_3km                   (x, y) float32 ...
    soil_moisture_std_dev_apm_3km               (x, y) float32 ...
    spacecraft_overpass_time_seconds_3km        (x, y) timedelta64[ns] ...
    spacecraft_overpass_time_seconds_apm_3km    (x, y) timedelta64[ns] ...
    surface_flag_3km                            (x, y) float32 ...
    surface_flag_apm_3km                        (x, y) float32 ...
    surface_temperature_3km                     (x, y) float32 ...
    surface_temperature_apm_3km                 (x, y) float32 ...
    tb_v_disaggregated_3km                      (x, y) float32 ...
    tb_v_disaggregated_apm_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_3km                  (x, y) float32 ...
    tb_v_disaggregated_std_apm_3km              (x, y) float32 ...
    vegetation_opacity_3km                      (x, y) float32 ...
    vegetation_opacity_apm_3km                  (x, y) float32 ...
    vegetation_water_content_3km                (x, y) float32 ...
    vegetation_water_content_apm_3km            (x, y) float32 ...
    water_body_fraction_3km                     (x, y) float32 ...
    water_body_fraction_apm_3km                 (x, y) float32 ...

Example variable dataset.soil_moisture_3km

<xarray.DataArray 'soil_moisture_3km' (x: 54, y: 129)>
array([[nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       ...,
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan],
       [nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
    EASE_column_index_3km      (x, y) float32 ...
    EASE_column_index_apm_3km  (x, y) float32 ...
    EASE_row_index_3km         (x, y) float32 ...
    EASE_row_index_apm_3km     (x, y) float32 ...
    latitude_3km               (x, y) float32 ...
    latitude_apm_3km           (x, y) float32 ...
    longitude_3km              (x, y) float32 ...
    longitude_apm_3km          (x, y) float32 ...
Dimensions without coordinates: x, y
Attributes:
    units:        cm**3/cm**3
    valid_min:    0.0
    long_name:    Representative soil moisture measurement for the 3 km Earth...
    coordinates:  /Soil_Moisture_Retrieval_Data_3km/latitude_3km /Soil_Moistu...
    valid_max:    0.75

First i tried to open the files with:

test = xr.open_mfdataset(list_of_paths)

this error occures:

ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {129, 132}

Then i try combine by coords

test = xr.open_mfdataset(list_of_paths, combine='by_coords')

produces this error:

ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation

try this:

test = xr.open_mfdataset(list_of_paths, coords=['latitude_3km', 'longitude_3km'], combine='by_coords')

end up with same error.

Then i try to open every file with xr.open_dataset() and try every method i can find on documentation page for combining data like merge, combine, broadcast_like, align & combine... but every time end up with the same problem that the dimensions are not equal. What is the common approach to reshape, align the dimensions or whatever is possible to solve this problem ?

UPDATE :
I found a workaround for my problem, but first I think I have forgotten to mention that the different files which I try to concatenate along the dimension time have different coordinates and dimensions. The images I try to build my model from all have overlapping areas with same longitude and latitude values but also parts with no overlapping.

回答1:

The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.

Sorry, is len(x) the same in every file? And the len(y) the same? Otherwise this can't be handled immediately by open_mfdataset.

If they are the same, you should in theory be able to do this in two different ways.

Then you have a 2D concatenation problem: you need to arrange the datasets such that when joined up along x and y, they make a larger dataset which also has dimensions x and y.

1) Using combine='nested'

You can manually specify the order that you need them joined up in. xarray allows you to do this by passing the datasets as a grid, specified as a nested list. In your case, if we had 4 files (named [upper_left, upper_right, lower_left, lower_right]), we would combine them like so:

from xarray import open_mfdataset

grid = [[upper_left, upper_right], 
        [lower_left, lower_right]]

ds = open_mfdataset(grid, concat_dim=['x', 'y'], combine='nested')

We had to tell open_mfdataset which dimensions of the data the rows and colums of the grid corresponded to, so it would know which dimensions to concatenate the data along. That's why we needed to pass concat_dim=['x', 'y'].

2) Using combine='by_coords'

But your data has coordinates in it already - can't xarray just use those to arrange the datasets in the right order? That is what the combine='by_coords' option is for, but unfortunately, it requires 1-dimensional coordinates (also known as dimensional coordinates) to arrange the data. Your files don't have any of those (that's why the printout says Dimensions without coordinates: x, y).

If you can add 1-dimensional coordinates to your files first, then you could use combine='by_coords', then you could just pass a list of all the files in any order. But otherwise you'll have to use combine='nested' in this case.

(You don't need the coords argument here, that's to do with how different coordinates are to be joined up, not the arrangement of datasets to use.)

回答2:

My workaround is that I create a grid from the unique lon/lat values from all the .h5 files.

import xarray as xr

EASE_lat = list()
EASE_lon = list()

for file in files:
    ds = xr.open_dataset(file)
    lat = ds.latitude_3km.to_series().to_list()
    lon = ds.longitude_3km.to_series().to_list()
    EASE_lat.extend(lat)
    EASE_lon.extend(lon)


unique_lon = list(set(lon_list))
unique_lat = list(set(lat_list))

unique_lon_dim = np.arange(0,len(unique_lon),1).astype('float32')
unique_lat_dim = np.arange(0,len(unique_lat),1).astype('float32')

longitude_3km_coord = np.sort(np.array(unique_lon).astype('float32'))
latitude_3km_coord = np.sort(np.array(unique_lat).astype('float32'))

var_1, var_2 = np.meshgrid(latitude_3km_coord, longitude_3km_coord )
np.place(var_1, var_1 != 1, np.nan)
np.place(var_2, var_2 != 1, np.nan)

print('var_1', var_1.shape, 'dims: (lat/lon) ', unique_lon_dim.shape ,unique_lat_dim.shape , 'coords : (lon/lat)', longitude_3km_coord.shape, latitude_3km_coord.shape)

var_1: (237, 126) dims(lat/lon): (237,) (126,) coords (lon/lat) : (237,) (126,)

Now i can create a base dataset

init_ds_2v = xr.Dataset(
        data_vars={'soil_moisture_3km':    (('longitude_3km', 'latitude_3km'), var_1),
                   'radolan_3km': (('longitude_3km', 'latitude_3km'), var_2)},
        coords={'longitude_3km': longitude_3km_coord,
                'latitude_3km': latitude_3km_coord})

print(init_ds_2v)
<xarray.Dataset>
Dimensions:            (latitude_3km: 126, longitude_3km: 237)
Coordinates:
  * longitude_3km      (longitude_3km) float32 5.057054 5.0881743 ... 12.401452
  * latitude_3km       (latitude_3km) float32 47.54788 47.582508 ... 52.0727
Data variables:
    soil_moisture_3km  (longitude_3km, latitude_3km) float32 nan nan ... nan nan
    radolan_3km        (longitude_3km, latitude_3km) float32 nan nan ... nan nan

Now i can merge any of these unequal datasets with the base grid

compilation = ds.merge(init_ds_2v, compat='override')

This step i do in a preprocess function which i can apply in the openmfdataset function

def preprocess_SMAP_3km(ds):
    compilation = None
    filename = ds.encoding['source'][-74:]
    date = datetime.datetime.strptime(filename[21:29], '%Y%m%d')
    date = np.datetime64(date)
    ds['latitude_3km'] = ds['latitude_3km'][:,0] #-> 1d array
    ds['longitude_3km'] = ds['longitude_3km'][0,:] #-> 1d array
    #Set Coordinates for x(lon) and y(lat)
    ds = ds.rename_dims({'phony_dim_2' : 'latitude', 'phony_dim_3' : 'longitude'})
    ds = ds.swap_dims({'longitude' : 'longitude_3km', 'latitude' : 'latitude_3km'})
    ds = ds.set_coords(['latitude_3km' , 'longitude_3km'])
    ds = ds['soil_moisture_3km'].to_dataset()
    ds['time'] = date
    ds.expand_dims('time').set_coords('time')
    compilation = ds.merge(init_ds_2v, compat='override')
    print(compilation)
    return compilation

data = xr.open_mfdataset(files, preprocess=preprocess_SMAP_3km, concat_dim='time')

I end up with this dataset

<xarray.Dataset>
Dimensions:            (latitude_3km: 126, longitude_3km: 237, time: 1012)
Coordinates:
  * latitude_3km       (latitude_3km) float64 47.55 47.58 47.62 ... 52.03 52.07
  * longitude_3km      (longitude_3km) float64 5.057 5.088 5.119 ... 12.37 12.4
  * time               (time) datetime64[ns] 2015-04-01 ... 2019-11-30
Data variables:
    soil_moisture_3km  (time, latitude_3km, longitude_3km) float32 dask.array<chunksize=(1, 126, 237), meta=np.ndarray>
    radolan_3km        (time, longitude_3km, latitude_3km) float32 nan ... nan

来源：https://stackoverflow.com/questions/59288473/how-do-i-combine-multiple-datasets-h5-files-with-different-dimensions-sizes-i

标签

python-3.x

python-xarray