Writing xarray multiindex data in chunks

前端 未结 2 718
夕颜
夕颜 2021-02-20 01:11

I am trying to efficiently restructure a large multidimentional dataset. Let assume I have a number of remotely sensed images over time with a number of bands with coordinates x

相关标签:
2条回答
  • 2021-02-20 01:30

    I have a solution here (https://github.com/pydata/xarray/issues/1077#issuecomment-644803374) for writing multiindexed datasets to file.

    You'll have to manually "encode" the dataset into a form that can be written as netCDF. And then "decode" when you read it back.

    import numpy as np
    import pandas as pd
    import xarray as xr
    
    
    def encode_multiindex(ds, idxname):
        encoded = ds.reset_index(idxname)
        coords = dict(zip(ds.indexes[idxname].names, ds.indexes[idxname].levels))
        for coord in coords:
            encoded[coord] = coords[coord].values
        shape = [encoded.sizes[coord] for coord in coords]
        encoded[idxname] = np.ravel_multi_index(ds.indexes[idxname].codes, shape)
        encoded[idxname].attrs["compress"] = " ".join(ds.indexes[idxname].names)
        return encoded
    
    
    def decode_to_multiindex(encoded, idxname):
        names = encoded[idxname].attrs["compress"].split(" ")
        shape = [encoded.sizes[dim] for dim in names]
        indices = np.unravel_index(encoded.landpoint.values, shape)
        arrays = [encoded[dim].values[index] for dim, index in zip(names, indices)]
        mindex = pd.MultiIndex.from_arrays(arrays)
    
        decoded = xr.Dataset({}, {idxname: mindex})
        for varname in encoded.data_vars:
            if idxname in encoded[varname].dims:
                decoded[varname] = (idxname, encoded[varname].values)
        return decoded
    
    0 讨论(0)
  • 2021-02-20 01:38

    This is not the solution, for the moment, but a version of your code, modifed so that it will be easily reproducible if others want to try to solve this problem:

    The problem is with the stack operation (concatenated.stack(sample=('y','x','time')). At this step, the memory keeps increasing and the process is killed.

    The concatenated object is a "Dask-backed" xarray.DataArray. So we could expect the stack operation to be done lazily by Dask. So, why is the process killed at this step ?

    2 possibilities for what is happening here:

    • The stack operation is in fact done lazily by Dask, but because the data are very that huge, even the minimum required memory for Dask is too much

    • The stack operation is NOT Dask-backed

    
    import numpy as np
    import dask.array as da
    import xarray as xr
    from numpy.random import RandomState
    
    nrows = 20000
    ncols = 20000
    row_chunks = 500
    col_chunks = 500
    
    
    # Create a reproducible random numpy array
    prng = RandomState(1234567890)
    numpy_array = prng.rand(1, nrows, ncols)
    
    data = da.from_array(numpy_array, chunks=(1, row_chunks, col_chunks))
    
    
    def create_band(data, x, y, band_name):
    
        return xr.DataArray(data,
                            dims=('band', 'y', 'x'),
                            coords={'band': [band_name],
                                    'y': y,
                                    'x': x})
    
    def create_coords(data, left, top, celly, cellx):
        nrows = data.shape[-2]
        ncols = data.shape[-1]
        right = left + cellx*ncols
        bottom = top - celly*nrows
        x = np.linspace(left, right, ncols) + cellx/2.0
        y = np.linspace(top, bottom, nrows) - celly/2.0
        
        return x, y
    
    
    x, y = create_coords(data, 1000, 2000, 30, 30)
    
    bands = ['blue', 'green', 'red', 'nir']
    times = ['t1', 't2', 't3']
    bands_list = [create_band(data, x, y, band) for band in bands]
    
    src = []
    
    for time in times:
    
        src_t = xr.concat(bands_list, dim='band')\
                        .expand_dims(dim='time')\
                        .assign_coords({'time': [time]})
    
        src.append(src_t)
    
    
    concatenated = xr.concat(src, dim='time')
    print(concatenated)
    # computed = concatenated.compute() # "computed" is ~35.8GB
    
    stacked = concatenated.stack(sample=('y','x','time'))
    
    transposed = stacked.T
    
    

    One can try to change the values of nrows and ncols in order to vary the size of concatenated. And for performance we could/should vary the chunks too.

    Note: I even tried this

    concatenated.to_netcdf("concatenated.nc")
    concatenated = xr.open_dataarray("concatenated.nc", chunks=10)
    

    This is in order to be sure it's a Dask-backed DataArray and to be able to adjust the chunks too. I tried different value/s for chunks: but always out of memory.

    0 讨论(0)
提交回复
热议问题