问题
Forgive me for a vague title. I honestly don't know which title will suit this question. If you have a better title, let's change it so that it will be apt for the problem at hand.
The problem.
Let's say result
is a 2D array and values
is a 1D array. values
holds some values associated with each element in result
. The mapping of an element in values
to result
is stored in x_mapping
and y_mapping
. A position in result
can be associated with different values. Now, I have to find the sum of the values grouped by associations.
An example for better clarification.
result
array:
[[0, 0],
[0, 0],
[0, 0],
[0, 0]]
values
array:
[ 1., 2., 3., 4., 5., 6., 7., 8.]
Note: Here result
and values
have the same number of elements. But it might not be the case. There is no relation between the sizes at all.
x_mapping
and y_mapping
have mappings from 1D values
to 2D result
. The sizes of x_mapping
, y_mapping
and values
will be the same.
x_mapping
- [0, 1, 0, 0, 0, 0, 0, 0]
y_mapping
- [0, 3, 2, 2, 0, 3, 2, 1]
Here, 1st value(values[0]
) have x as 0 and y as 0(x_mapping[0]
and y_mappping[0]
) and hence associated with result[0, 0]
. If we are counting the number of associations, then element value at result[0,0]
will be 2 as 1st value and 5th value are associated with result[0, 0]
. If we are taking the sum, the result[0, 0] = value[0] + value[4]
which is 6.
Current solution
# Initialisation. No connection with the solution.
result = np.zeros([4,2], dtype=np.int16)
values = np.linspace(start=1, stop=8, num=8)
y_mapping = np.random.randint(low=0, high=values.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=values.shape[1], size=values.shape[0])
# Summing the values associated with x,y (current solution.)
for i in range(values.size):
x = x_mapping[i]
y = y_mapping[i]
result[-y, x] = result[-y, x] + values[i]
The result
,
[[6, 0],
[ 6, 2],
[14, 0],
[ 8, 0]]
Failed solution; But why?
test_result = np.zeros_like(result)
test_result[-y_mapping, x_mapping] = test_result[-y_mapping, x_mapping] + values # solution
To my surprise elements are overwritten in test_result
. Values at test_result
,
[[5, 0],
[6, 2],
[7, 0],
[8, 0]]
Question
1. Why, in the second solution, every element is overwritten?
As @Divakar has pointed out in the comment in his answer -
NumPy doesn't assign accumulated/summed values when the indices are repeated in test_result[-y_mapping, x_mapping] =
. It randomly assigns from one of the instances.
2. Is there any Numpy way to do this? That is without looping? I'm looking for some speed optimization.
Approach #2 in @Divakar's answer gives me good results. For 23315 associations, for
loop took 50 ms while Approach #1 took 1.85 ms. Beating all these, Approach #2 took 668 µs.
Side note
I'm using Numpy version 1.14.3 with Python 3.5.2 on an i7 processor.
回答1:
Approach #1
Most intutive one would be with np.add.at
for those repeated indices -
np.add.at(result, [-y_mapping, x_mapping], values)
Approach #2
We need to perform binned summations owing to the possible repeated nature of x,y indices. Hence, another way could be to use NumPy's binned summation func : np.bincount
and have an implementation like so -
# Get linear index equivalents off the x and y indices into result array
m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping
# Get binned summations off values based on linear index as bins
binned_sums = np.bincount(lidx, values, minlength=m*n)
# Finally add into result array
result += binned_sums.astype(result.dtype).reshape(m,n)
If you are always starting off with a zeros array for result
, the last step could be made more performant with -
result = binned_sums.astype(out_dtype).reshape(m,n)
回答2:
I guess you were to write
y_mapping = np.random.randint(low=0, high=result.shape[0], size=values.shape[0])
x_mapping = np.random.randint(low=0, high=result.shape[1], size=values.shape[0])
With that correction, the code works for me as expected.
来源:https://stackoverflow.com/questions/51691980/numpy-grouping-binning-values-based-on-associations