问题
I have some experimental data that exists like so:
x = array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1, ...])
y = array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75, ...])
z = array([10, 4, 1, 4, 5, 0, 1, ...])
If it's convenient, we can assume that the data exists as a 3D array or even a pandas DataFrame
:
df = pd.DataFrame({'x': x, 'y': y, 'z': z})
The interpretation being, for every position x[i], y[i]
, the value of some variable is z[i]
. These are not evenly sampled, so there will be some parts that are "densely sampled" (e.g. between 1 and 1.2 in x
) and others that are very sparse (e.g. between 2 and 3 in x
). Because of this, I can't just chuck these into a pcolormesh
or contourf
.
What I would like to do instead is to resample x
and y
evenly at some fixed interval and then aggregate the values of z
. For my needs, z
can be summed or averaged to get meaningful values, so this is not a problem. My naïve attempt was like this:
X = np.arange(min(x), max(x), 0.1)
Y = np.arange(min(y), max(y), 0.1)
x_g, y_g = np.meshgrid(X, Y)
nx, ny = x_g.shape
z_g = np.full(x_g.shape, np.nan)
for ix in range(nx - 1):
for jx in range(ny - 1):
x_min = x_g[ix, jx]
x_max = x_g[ix + 1, jx + 1]
y_min = y_g[ix, jx]
y_max = y_g[ix + 1, jx + 1]
vals = df[(df.x >= x_min) & (df.x < x_max) &
(df.y >= y_min) & (df.y < y_max)].z.values
if vals.any():
z_g[ix, jx] = sum(vals)
This works and I get the output I desire, with plt.contourf(x_g, y_g, z_g)
but it is SLOW! I have ~20k samples, which I then subsample into ~800 samples in x and ~500 in y, meaning the for loop is 400k long.
Is there any way to vectorize/optimize this? Even better if there is some function that already does this!
(Also tagging this as MATLAB because the syntax between numpy/MATLAB are very similar and I have access to both software.)
回答1:
Here's a vectorized Python solution employing NumPy broadcasting and matrix multiplication
with np.dot for the sum-reduction part -
x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))
z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)
# If needed to fill invalid places with NaNs
z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan
Note that we are avoiding the use of meshgrid
there. Thus, saving memory there as the meshes created with meshgrid
would be huge and in the process hopefully gaining performance improvement.
Benchmarking
# Original app
def org_app(x,y,z):
X = np.arange(min(x), max(x), 0.1)
Y = np.arange(min(y), max(y), 0.1)
x_g, y_g = np.meshgrid(X, Y)
nx, ny = x_g.shape
z_g = np.full(np.asarray(x_g.shape)-1, np.nan)
for ix in range(nx - 1):
for jx in range(ny - 1):
x_min = x_g[ix, jx]
x_max = x_g[ix + 1, jx + 1]
y_min = y_g[ix, jx]
y_max = y_g[ix + 1, jx + 1]
vals = z[(x >= x_min) & (x < x_max) &
(y >= y_min) & (y < y_max)]
if vals.any():
z_g[ix, jx] = sum(vals)
return z_g
# Proposed app
def app1(x,y,z):
X = np.arange(min(x), max(x), 0.1)
Y = np.arange(min(y), max(y), 0.1)
x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))
z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)
# If needed to fill invalid places with NaNs
z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan
return z_g_out
As seen, for a fair benchmarking, I am using array values with the original approach, as fetching values from a dataframe could slow things down.
Timings and verification -
In [143]: x = np.array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1])
...: y = np.array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75])
...: z = np.array([10, 4, 1, 4, 5, 0, 1])
...:
# Verify outputs
In [150]: np.nansum(np.abs(org_app(x,y,z) - app1(x,y,z)))
Out[150]: 0.0
In [145]: %timeit org_app(x,y,z)
10 loops, best of 3: 19.9 ms per loop
In [146]: %timeit app1(x,y,z)
10000 loops, best of 3: 39.1 µs per loop
In [147]: 19900/39.1 # Speedup figure
Out[147]: 508.95140664961633
回答2:
Here is a MATLAB solution:
X = min(x)-1 :.1:max(x)+1; % the grid needs to be expanded slightly beyond the min and max
Y = min(y)-1 :.1:max(y)+1;
x_o = interp1(X, 1:numel(X), x, 'nearest');
y_o = interp1(Y, 1:numel(Y), y, 'nearest');
z_g = accumarray([x_o(:) y_o(:)], z(:),[numel(X) numel(Y)]);
来源:https://stackoverflow.com/questions/45777934/creating-a-heatmap-by-sampling-and-bucketing-from-a-3d-array