问题
I tried several methods to make a xarray (xr) dataset out of multiple .h5 files. The files contain data from SMAP project on soil moisture content along with other useful variables. Each variable represent a 2-D Array. The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.
Example dataset load via xr.open_dataset()
<xarray.Dataset>
Dimensions: (x: 54, y: 129)
Coordinates:
EASE_column_index_3km (x, y) float32 ...
EASE_column_index_apm_3km (x, y) float32 ...
EASE_row_index_3km (x, y) float32 ...
EASE_row_index_apm_3km (x, y) float32 ...
latitude_3km (x, y) float32 ...
latitude_apm_3km (x, y) float32 ...
longitude_3km (x, y) float32 ...
longitude_apm_3km (x, y) float32 ...
Dimensions without coordinates: x, y
Data variables:
SMAP_Sentinel_overpass_timediff_hr_3km (x, y) timedelta64[ns] ...
SMAP_Sentinel_overpass_timediff_hr_apm_3km (x, y) timedelta64[ns] ...
albedo_3km (x, y) float32 ...
albedo_apm_3km (x, y) float32 ...
bare_soil_roughness_retrieved_3km (x, y) float32 ...
bare_soil_roughness_retrieved_apm_3km (x, y) float32 ...
beta_tbv_vv_3km (x, y) float32 ...
beta_tbv_vv_apm_3km (x, y) float32 ...
disagg_soil_moisture_3km (x, y) float32 ...
disagg_soil_moisture_apm_3km (x, y) float32 ...
disaggregated_tb_v_qual_flag_3km (x, y) float32 ...
disaggregated_tb_v_qual_flag_apm_3km (x, y) float32 ...
gamma_vv_xpol_3km (x, y) float32 ...
gamma_vv_xpol_apm_3km (x, y) float32 ...
landcover_class_3km (x, y) float32 ...
landcover_class_apm_3km (x, y) float32 ...
retrieval_qual_flag_3km (x, y) float32 ...
retrieval_qual_flag_apm_3km (x, y) float32 ...
sigma0_incidence_angle_3km (x, y) float32 ...
sigma0_incidence_angle_apm_3km (x, y) float32 ...
sigma0_vh_aggregated_3km (x, y) float32 ...
sigma0_vh_aggregated_apm_3km (x, y) float32 ...
sigma0_vv_aggregated_3km (x, y) float32 ...
sigma0_vv_aggregated_apm_3km (x, y) float32 ...
soil_moisture_3km (x, y) float32 ...
soil_moisture_apm_3km (x, y) float32 ...
soil_moisture_std_dev_3km (x, y) float32 ...
soil_moisture_std_dev_apm_3km (x, y) float32 ...
spacecraft_overpass_time_seconds_3km (x, y) timedelta64[ns] ...
spacecraft_overpass_time_seconds_apm_3km (x, y) timedelta64[ns] ...
surface_flag_3km (x, y) float32 ...
surface_flag_apm_3km (x, y) float32 ...
surface_temperature_3km (x, y) float32 ...
surface_temperature_apm_3km (x, y) float32 ...
tb_v_disaggregated_3km (x, y) float32 ...
tb_v_disaggregated_apm_3km (x, y) float32 ...
tb_v_disaggregated_std_3km (x, y) float32 ...
tb_v_disaggregated_std_apm_3km (x, y) float32 ...
vegetation_opacity_3km (x, y) float32 ...
vegetation_opacity_apm_3km (x, y) float32 ...
vegetation_water_content_3km (x, y) float32 ...
vegetation_water_content_apm_3km (x, y) float32 ...
water_body_fraction_3km (x, y) float32 ...
water_body_fraction_apm_3km (x, y) float32 ...
Example variable dataset.soil_moisture_3km
<xarray.DataArray 'soil_moisture_3km' (x: 54, y: 129)>
array([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], dtype=float32)
Coordinates:
EASE_column_index_3km (x, y) float32 ...
EASE_column_index_apm_3km (x, y) float32 ...
EASE_row_index_3km (x, y) float32 ...
EASE_row_index_apm_3km (x, y) float32 ...
latitude_3km (x, y) float32 ...
latitude_apm_3km (x, y) float32 ...
longitude_3km (x, y) float32 ...
longitude_apm_3km (x, y) float32 ...
Dimensions without coordinates: x, y
Attributes:
units: cm**3/cm**3
valid_min: 0.0
long_name: Representative soil moisture measurement for the 3 km Earth...
coordinates: /Soil_Moisture_Retrieval_Data_3km/latitude_3km /Soil_Moistu...
valid_max: 0.75
First i tried to open the files with:
test = xr.open_mfdataset(list_of_paths)
this error occures:
ValueError: arguments without labels along dimension 'x' cannot be aligned because they have different dimension sizes: {129, 132}
Then i try combine by coords
test = xr.open_mfdataset(list_of_paths, combine='by_coords')
produces this error:
ValueError: Could not find any dimension coordinates to use to order the datasets for concatenation
try this:
test = xr.open_mfdataset(list_of_paths, coords=['latitude_3km', 'longitude_3km'], combine='by_coords')
end up with same error.
Then i try to open every file with xr.open_dataset() and try every method i can find on documentation page for combining data like merge, combine, broadcast_like, align & combine... but every time end up with the same problem that the dimensions are not equal. What is the common approach to reshape, align the dimensions or whatever is possible to solve this problem ?
UPDATE :
I found a workaround for my problem, but first I think I have forgotten to mention that the different files which I try to concatenate along the dimension time have different coordinates and dimensions. The images I try to build my model from all have overlapping areas with same longitude and latitude values but also parts with no overlapping.
回答1:
The count of variables and their label are in every file equal. The problem is the dimensions size of dimension x and y are not equal.
Sorry, is len(x)
the same in every file? And the len(y)
the same? Otherwise this can't be handled immediately by open_mfdataset
.
If they are the same, you should in theory be able to do this in two different ways.
Then you have a 2D concatenation problem: you need to arrange the datasets such that when joined up along x and y, they make a larger dataset which also has dimensions x and y.
1) Using combine='nested'
You can manually specify the order that you need them joined up in. xarray allows you to do this by passing the datasets as a grid, specified as a nested list. In your case, if we had 4 files (named [upper_left, upper_right, lower_left, lower_right]), we would combine them like so:
from xarray import open_mfdataset
grid = [[upper_left, upper_right],
[lower_left, lower_right]]
ds = open_mfdataset(grid, concat_dim=['x', 'y'], combine='nested')
We had to tell open_mfdataset
which dimensions of the data the rows and colums of the grid corresponded to, so it would know which dimensions to concatenate the data along. That's why we needed to pass concat_dim=['x', 'y']
.
2) Using combine='by_coords'
But your data has coordinates in it already - can't xarray just use those to arrange the datasets in the right order? That is what the combine='by_coords'
option is for, but unfortunately, it requires 1-dimensional coordinates (also known as dimensional coordinates) to arrange the data. Your files don't have any of those (that's why the printout says Dimensions without coordinates: x, y
).
If you can add 1-dimensional coordinates to your files first, then you could use combine='by_coords'
, then you could just pass a list of all the files in any order. But otherwise you'll have to use combine='nested'
in this case.
(You don't need the coords
argument here, that's to do with how different coordinates are to be joined up, not the arrangement of datasets to use.)
回答2:
My workaround is that I create a grid from the unique lon/lat values from all the .h5 files.
import xarray as xr
EASE_lat = list()
EASE_lon = list()
for file in files:
ds = xr.open_dataset(file)
lat = ds.latitude_3km.to_series().to_list()
lon = ds.longitude_3km.to_series().to_list()
EASE_lat.extend(lat)
EASE_lon.extend(lon)
unique_lon = list(set(lon_list))
unique_lat = list(set(lat_list))
unique_lon_dim = np.arange(0,len(unique_lon),1).astype('float32')
unique_lat_dim = np.arange(0,len(unique_lat),1).astype('float32')
longitude_3km_coord = np.sort(np.array(unique_lon).astype('float32'))
latitude_3km_coord = np.sort(np.array(unique_lat).astype('float32'))
var_1, var_2 = np.meshgrid(latitude_3km_coord, longitude_3km_coord )
np.place(var_1, var_1 != 1, np.nan)
np.place(var_2, var_2 != 1, np.nan)
print('var_1', var_1.shape, 'dims: (lat/lon) ', unique_lon_dim.shape ,unique_lat_dim.shape , 'coords : (lon/lat)', longitude_3km_coord.shape, latitude_3km_coord.shape)
var_1: (237, 126) dims(lat/lon): (237,) (126,) coords (lon/lat) : (237,) (126,)
Now i can create a base dataset
init_ds_2v = xr.Dataset(
data_vars={'soil_moisture_3km': (('longitude_3km', 'latitude_3km'), var_1),
'radolan_3km': (('longitude_3km', 'latitude_3km'), var_2)},
coords={'longitude_3km': longitude_3km_coord,
'latitude_3km': latitude_3km_coord})
print(init_ds_2v)
<xarray.Dataset>
Dimensions: (latitude_3km: 126, longitude_3km: 237)
Coordinates:
* longitude_3km (longitude_3km) float32 5.057054 5.0881743 ... 12.401452
* latitude_3km (latitude_3km) float32 47.54788 47.582508 ... 52.0727
Data variables:
soil_moisture_3km (longitude_3km, latitude_3km) float32 nan nan ... nan nan
radolan_3km (longitude_3km, latitude_3km) float32 nan nan ... nan nan
Now i can merge any of these unequal datasets with the base grid
compilation = ds.merge(init_ds_2v, compat='override')
This step i do in a preprocess function which i can apply in the openmfdataset function
def preprocess_SMAP_3km(ds):
compilation = None
filename = ds.encoding['source'][-74:]
date = datetime.datetime.strptime(filename[21:29], '%Y%m%d')
date = np.datetime64(date)
ds['latitude_3km'] = ds['latitude_3km'][:,0] #-> 1d array
ds['longitude_3km'] = ds['longitude_3km'][0,:] #-> 1d array
#Set Coordinates for x(lon) and y(lat)
ds = ds.rename_dims({'phony_dim_2' : 'latitude', 'phony_dim_3' : 'longitude'})
ds = ds.swap_dims({'longitude' : 'longitude_3km', 'latitude' : 'latitude_3km'})
ds = ds.set_coords(['latitude_3km' , 'longitude_3km'])
ds = ds['soil_moisture_3km'].to_dataset()
ds['time'] = date
ds.expand_dims('time').set_coords('time')
compilation = ds.merge(init_ds_2v, compat='override')
print(compilation)
return compilation
data = xr.open_mfdataset(files, preprocess=preprocess_SMAP_3km, concat_dim='time')
I end up with this dataset
<xarray.Dataset>
Dimensions: (latitude_3km: 126, longitude_3km: 237, time: 1012)
Coordinates:
* latitude_3km (latitude_3km) float64 47.55 47.58 47.62 ... 52.03 52.07
* longitude_3km (longitude_3km) float64 5.057 5.088 5.119 ... 12.37 12.4
* time (time) datetime64[ns] 2015-04-01 ... 2019-11-30
Data variables:
soil_moisture_3km (time, latitude_3km, longitude_3km) float32 dask.array<chunksize=(1, 126, 237), meta=np.ndarray>
radolan_3km (time, longitude_3km, latitude_3km) float32 nan ... nan
来源:https://stackoverflow.com/questions/59288473/how-do-i-combine-multiple-datasets-h5-files-with-different-dimensions-sizes-i