I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x
I have a solution here (https://github.com/pydata/xarray/issues/1077#issuecomment-644803374) for writing multiindexed datasets to file.
You'll have to manually "encode" the dataset into a form that can be written as netCDF. And then "decode" when you read it back.
import numpy as np
import pandas as pd
import xarray as xr
def encode_multiindex(ds, idxname):
encoded = ds.reset_index(idxname)
coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
for coord in coords:
encoded[coord] = coords[coord].values
shape = [encoded.sizes[coord] for coord in coords]
encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)
return encoded
def decode_to_multiindex(encoded, idxname):
names = encoded[idxname].attrs["compress"].split(" ")
shape = [encoded.sizes[dim] for dim in names]
indices = np.unravel_index(encoded.landpoint.values, shape)
arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
mindex = pd.MultiIndex.from_arrays(arrays)
decoded = xr.Dataset({}, {idxname: mindex})
for varname in encoded.data_vars:
if idxname in encoded[varname].dims:
decoded[varname] = (idxname, encoded[varname].values)
return decoded
This is not the solution, for the moment, but a version of your code, modifed so that it will be easily reproducible if others want to try to solve this problem:
The problem is with the stack
operation (concatenated.stack(sample=('y','x','time')
).
At this step, the memory keeps increasing and the process is killed
.
The concatenated
object is a "Dask-backed" xarray.DataArray
. So we could expect the stack
operation to be done lazily by Dask. So, why is the process killed
at this step ?
2 possibilities for what is happening here:
The stack
operation is in fact done lazily by Dask, but because the data are very that huge, even the minimum required memory for Dask is too much
The stack
operation is NOT Dask-backed
import numpy as np
import dask.array as da
import xarray as xr
from numpy.random import RandomState
nrows = 20000
ncols = 20000
row_chunks = 500
col_chunks = 500
# Create a reproducible random numpy array
prng = RandomState(1234567890)
numpy_array = prng.rand(1, nrows, ncols)
data = da.from_array(numpy_array, chunks=(1, row_chunks, col_chunks))
def create_band(data, x, y, band_name):
return xr.DataArray(data,
dims=('band', 'y', 'x'),
coords={'band': [band_name],
'y': y,
'x': x})
def create_coords(data, left, top, celly, cellx):
nrows = data.shape[-2]
ncols = data.shape[-1]
right = left + cellx*ncols
bottom = top - celly*nrows
x = np.linspace(left, right, ncols) + cellx/2.0
y = np.linspace(top, bottom, nrows) - celly/2.0
return x, y
x, y = create_coords(data, 1000, 2000, 30, 30)
bands = ['blue', 'green', 'red', 'nir']
times = ['t1', 't2', 't3']
bands_list = [create_band(data, x, y, band) for band in bands]
src = []
for time in times:
src_t = xr.concat(bands_list, dim='band')\
.expand_dims(dim='time')\
.assign_coords({'time': [time]})
src.append(src_t)
concatenated = xr.concat(src, dim='time')
print(concatenated)
# computed = concatenated.compute() # "computed" is ~35.8GB
stacked = concatenated.stack(sample=('y','x','time'))
transposed = stacked.T
One can try to change the values of nrows
and ncols
in order to vary the size of concatenated
. And for performance we could/should vary the chunks
too.
Note: I even tried this
concatenated.to_netcdf("concatenated.nc")
concatenated = xr.open_dataarray("concatenated.nc", chunks=10)
This is in order to be sure it's a Dask-backed DataArray and to be able to adjust the chunks too.
I tried different value/s for chunks
: but always out of memory.