问题
I have a huge (~ 2 billion data points) xarray.DataArray
. I would like to randomly delete (either mask or replace by np.nan
) a given percentage of the data, where the probability for every data point to be chosen for deletion/masking is the same across all coordinates. I can convert the array to a numpy.array
but I would preferably keep it in the dask chunks for speed.
my data looks like this:
>> data
<xarray.DataArray 'stack-820860ba63bd07adc355885d96354267' (variable: 8, time: 228, latitude: 721, longitude: 1440)>
dask.array<stack, shape=(8, 228, 721, 1440), dtype=float64, chunksize=(1, 6, 721, 1440)>
Coordinates:
* latitude (latitude) float32 90.0 89.75 89.5 89.25 89.0 88.75 88.5 ...
* variable (variable) <U5 u'fal' u'swvl1' u'swvl3' u'e' u'swvl2' u'es'
* longitude (longitude) float32 0.0 0.25 0.5 0.75 1.0 1.25 1.5 1.75 2.0
* time (time) datetime64[ns] 2000-01-01 2000-02-01 2000-03-01 ...
I defined
frac_missing = 0.2
k = int(frac_missing*data.size)
this is what I already tried:
- this solution works with
np.ndindex
but thenp.ndindex
object is converted to a list which is very slow. I tried circumventing the conversion and simply iterate over thenp.ndindex
object as described here and here but iterating over the whole iterator is slow for ~ 2 billion data points. np.random.choice(data.stack(newdim=('latitude','variable','longitude','time')),k,replace=False)
returns the desired subset of data points, but does not set them to nan
The expected output would be the xarray.DataArray
with the given percentage of datapoints either set to np.nan
or masked, preferably in the same shape and the same dask chunks.
回答1:
The suggestion by user545424 is an excellent start. In order to not run into memory issues, you can put it in a small user-defined function and map it on the DataArray using the method apply_ufunc
.
import xarray as xr
import numpy as np
testdata = xr.DataArray(np.empty((100,1000,1000)), dims=['x','y','z'])
def set_random_fraction_to_nan(data):
data[np.random.rand(*data.shape) < .8]=np.nan
return data
# Set 80% of data randomly to nan
testdata = xr.apply_ufunc(set_random_fraction_to_nan, testdata, input_core_dims=[['x','y','z']],output_core_dims=[['x','y','z']], dask='parallelized')
For some more explanation on wrapping custom functions to work with xarray, see here.
来源:https://stackoverflow.com/questions/56257429/randomly-mask-set-nan-x-of-data-points-in-huge-xarray-dataarray