randomly mask/set nan x% of data points in huge xarray.DataArray

问题

I have a huge (~ 2 billion data points) xarray.DataArray. I would like to randomly delete (either mask or replace by np.nan) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same across all coordinates. I can convert the array to a numpy.array but I would preferably keep it in the dask chunks for speed.

my data looks like this:

>> data
<xarray.DataArray 'stack-820860ba63bd07adc355885d96354267' (variable: 8, time: 228, latitude: 721, longitude: 1440)>
dask.array<stack, shape=(8, 228, 721, 1440), dtype=float64, chunksize=(1, 6, 721, 1440)>
Coordinates:
* latitude   (latitude) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 ...
* variable   (variable) <U5 u'fal' u'swvl1' u'swvl3' u'e' u'swvl2' u'es' 
* longitude  (longitude) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0 
* time       (time) datetime64[ns] 2000-01-01 2000-02-01 2000-03-01 ...

I defined

frac_missing = 0.2
k = int(frac_missing*data.size)

this is what I already tried:

this solution works with np.ndindex but the np.ndindex object is converted to a list which is very slow. I tried circumventing the conversion and simply iterate over the np.ndindex object as described here and here but iterating over the whole iterator is slow for ~ 2 billion data points.
np.random.choice(data.stack(newdim=('latitude','variable','longitude','time')),k,replace=False) returns the desired subset of data points, but does not set them to nan

The expected output would be the xarray.DataArray with the given percentage of datapoints either set to np.nan or masked, preferably in the same shape and the same dask chunks.

回答1:

The suggestion by user545424 is an excellent start. In order to not run into memory issues, you can put it in a small user-defined function and map it on the DataArray using the method apply_ufunc.

import xarray as xr
import numpy as np

testdata = xr.DataArray(np.empty((100,1000,1000)), dims=['x','y','z'])

def set_random_fraction_to_nan(data):
    data[np.random.rand(*data.shape) < .8]=np.nan
    return data

# Set 80% of data randomly to nan
testdata = xr.apply_ufunc(set_random_fraction_to_nan, testdata, input_core_dims=[['x','y','z']],output_core_dims=[['x','y','z']], dask='parallelized')

For some more explanation on wrapping custom functions to work with xarray, see here.

来源：https://stackoverflow.com/questions/56257429/randomly-mask-set-nan-x-of-data-points-in-huge-xarray-dataarray

标签

python

numpy

dask

python-xarray