Creating a heatmap by sampling and bucketing from a 3D array

社会主义新天地 提交于 2019-12-22 09:46:55

问题


I have some experimental data that exists like so:

x = array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1, ...])
y = array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75, ...])
z = array([10, 4, 1, 4, 5, 0, 1, ...])

If it's convenient, we can assume that the data exists as a 3D array or even a pandas DataFrame:

df = pd.DataFrame({'x': x, 'y': y, 'z': z})

The interpretation being, for every position x[i], y[i], the value of some variable is z[i]. These are not evenly sampled, so there will be some parts that are "densely sampled" (e.g. between 1 and 1.2 in x) and others that are very sparse (e.g. between 2 and 3 in x). Because of this, I can't just chuck these into a pcolormesh or contourf.

What I would like to do instead is to resample x and y evenly at some fixed interval and then aggregate the values of z. For my needs, z can be summed or averaged to get meaningful values, so this is not a problem. My naïve attempt was like this:

X = np.arange(min(x), max(x), 0.1)  
Y = np.arange(min(y), max(y), 0.1)
x_g, y_g = np.meshgrid(X, Y)
nx, ny = x_g.shape
z_g = np.full(x_g.shape, np.nan)

for ix in range(nx - 1):
    for jx in range(ny - 1):
        x_min = x_g[ix, jx]
        x_max = x_g[ix + 1, jx + 1]
        y_min = y_g[ix, jx]
        y_max = y_g[ix + 1, jx + 1]
        vals = df[(df.x >= x_min) & (df.x < x_max) & 
                  (df.y >= y_min) & (df.y < y_max)].z.values
        if vals.any():
            z_g[ix, jx] = sum(vals)

This works and I get the output I desire, with plt.contourf(x_g, y_g, z_g) but it is SLOW! I have ~20k samples, which I then subsample into ~800 samples in x and ~500 in y, meaning the for loop is 400k long.

Is there any way to vectorize/optimize this? Even better if there is some function that already does this!

(Also tagging this as MATLAB because the syntax between numpy/MATLAB are very similar and I have access to both software.)


回答1:


Here's a vectorized Python solution employing NumPy broadcasting and matrix multiplication with np.dot for the sum-reduction part -

x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

# If needed to fill invalid places with NaNs
z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan

Note that we are avoiding the use of meshgrid there. Thus, saving memory there as the meshes created with meshgrid would be huge and in the process hopefully gaining performance improvement.

Benchmarking

# Original app
def org_app(x,y,z):    
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_g, y_g = np.meshgrid(X, Y)
    nx, ny = x_g.shape
    z_g = np.full(np.asarray(x_g.shape)-1, np.nan)

    for ix in range(nx - 1):
        for jx in range(ny - 1):
            x_min = x_g[ix, jx]
            x_max = x_g[ix + 1, jx + 1]
            y_min = y_g[ix, jx]
            y_max = y_g[ix + 1, jx + 1]
            vals = z[(x >= x_min) & (x < x_max) & 
                      (y >= y_min) & (y < y_max)]
            if vals.any():
                z_g[ix, jx] = sum(vals)
    return z_g

# Proposed app
def app1(x,y,z):
    X = np.arange(min(x), max(x), 0.1)  
    Y = np.arange(min(y), max(y), 0.1)
    x_mask = ((x >= X[:-1,None]) & (x < X[1:,None]))
    y_mask = ((y >= Y[:-1,None]) & (y < Y[1:,None]))

    z_g_out = np.dot(y_mask*z[None].astype(np.float32), x_mask.T)

    # If needed to fill invalid places with NaNs
    z_g_out[y_mask.dot(x_mask.T.astype(np.float32))==0] = np.nan
    return z_g_out

As seen, for a fair benchmarking, I am using array values with the original approach, as fetching values from a dataframe could slow things down.

Timings and verification -

In [143]: x = np.array([1, 1.12, 1.109, 2.1, 3, 4.104, 3.1])
     ...: y = np.array([-9, -0.1, -9.2, -8.7, -5, -4, -8.75])
     ...: z = np.array([10, 4, 1, 4, 5, 0, 1])
     ...: 

# Verify outputs
In [150]: np.nansum(np.abs(org_app(x,y,z) - app1(x,y,z)))
Out[150]: 0.0

In [145]: %timeit org_app(x,y,z)
10 loops, best of 3: 19.9 ms per loop

In [146]: %timeit app1(x,y,z)
10000 loops, best of 3: 39.1 µs per loop

In [147]: 19900/39.1  # Speedup figure
Out[147]: 508.95140664961633



回答2:


Here is a MATLAB solution:

X = min(x)-1 :.1:max(x)+1; % the grid needs to be expanded slightly beyond the min and max
Y = min(y)-1 :.1:max(y)+1;
x_o = interp1(X, 1:numel(X), x, 'nearest');
y_o = interp1(Y, 1:numel(Y), y, 'nearest');
z_g = accumarray([x_o(:) y_o(:)], z(:),[numel(X) numel(Y)]);


来源:https://stackoverflow.com/questions/45777934/creating-a-heatmap-by-sampling-and-bucketing-from-a-3d-array

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!