问题
Here is what I am trying to do with Numpy in Python 2.7. Suppose I have an array a
defined by the following:
a = np.array([[1,3,3],[4,5,6],[7,8,1]])
I can do a.argmax(0)
or a.argmax(1)
to get the row/column wise argmax:
a.argmax(0)
Out[329]: array([2, 2, 1], dtype=int64)
a.argmax(1)
Out[330]: array([1, 2, 1], dtype=int64)
However, when there is a tie like in a
's first row, I would like to get the argmax decided randomly between the ties (by default, Numpy returns the first element whenever a tie occurs in argmax or argmin).
Last year, someone put a question on solving Numpy argmax/argmin ties randomly: Select One Element in Each Row of a Numpy Array by Column Indices
However, the question aimed at uni-dimensional arrays. There, the most voted answer works well for that. There is a second answer that attempts to solve the problem also for multidimensional arrays but doesn't work - i.e. it does not return, for each row/column the index of the maximum value with ties solved randomly.
What would be the most performent way to do that, since I am working with big arrays?
回答1:
Generic case solution to pick one per group
To solve a general case of picking a random number from a list/array of numbers that specify the ranges for the picks, we would use a trick of creating a uniform rand array, add offset specified by the interval lengths and then perform argsort
. The implementation would look something like this -
def random_num_per_grp(L):
# For each element in L pick a random number within range specified by it
r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
offset = np.r_[0,np.cumsum(L[:-1])]
return r1.argsort()[offset] - offset
Sample case -
In [217]: L = [5,4,2]
In [218]: random_num_per_grp(L) # i.e. select one per [0-5,0-4,0-2]
Out[218]: array([2, 0, 1])
So, the output would have same number of elements as in input L
and the first output element would be in [0,5)
, second in [0,4)
and so on.
Solving our problem here
To solve our case here, we would use a modified version (specifically remove the offset removal part at the end of the func, like so -
def random_num_per_grp_cumsumed(L):
# For each element in L pick a random number within range specified by it
# The final output would be a cumsumed one for use with indexing, etc.
r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
offset = np.r_[0,np.cumsum(L[:-1])]
return r1.argsort()[offset]
Approach #1
One solution could use it like so -
def argmax_per_row_randtie(a):
max_mask = a==a.max(1,keepdims=1)
m,n = a.shape
all_argmax_idx = np.flatnonzero(max_mask)
offset = np.arange(m)*n
return all_argmax_idx[random_num_per_grp_cumsumed(max_mask.sum(1))] - offset
Verification
Let's test out on the given sample with a huge number of runs and count number of occurences for each index for each row
In [235]: a
Out[235]:
array([[1, 3, 3],
[4, 5, 6],
[7, 8, 1]])
In [225]: all_out = np.array([argmax_per_row_randtie(a) for i in range(10000)])
# The first element (row=0) should have similar probabilities for 1 and 2
In [236]: (all_out[:,0]==1).mean()
Out[236]: 0.504
In [237]: (all_out[:,0]==2).mean()
Out[237]: 0.496
# The second element (row=1) should only have 2
In [238]: (all_out[:,1]==2).mean()
Out[238]: 1.0
# The third element (row=2) should only have 1
In [239]: (all_out[:,2]==1).mean()
Out[239]: 1.0
Approach #2 : Use masking
for performance
We could make use of masking
and hence avoid that flatnonzero
with the intention of gaining performance as working with boolean arrays generally is. Also, we would generalize to cover both rows (axis=1) and columns(axis=0) to give ourselves a modified one, like so -
def argmax_randtie_masking_generic(a, axis=1):
max_mask = a==a.max(axis=axis,keepdims=True)
m,n = a.shape
L = max_mask.sum(axis=axis)
set_mask = np.zeros(L.sum(), dtype=bool)
select_idx = random_num_per_grp_cumsumed(L)
set_mask[select_idx] = True
if axis==0:
max_mask.T[max_mask.T] = set_mask
else:
max_mask[max_mask] = set_mask
return max_mask.argmax(axis=axis)
Sample runs on axis=0
and axis=1
-
In [423]: a
Out[423]:
array([[1, 3, 3],
[4, 5, 6],
[7, 8, 1]])
In [424]: argmax_randtie_masking_generic(a, axis=1)
Out[424]: array([1, 2, 1])
In [425]: argmax_randtie_masking_generic(a, axis=1)
Out[425]: array([2, 2, 1])
In [426]: a[1,1] = 8
In [427]: a
Out[427]:
array([[1, 3, 3],
[4, 8, 6],
[7, 8, 1]])
In [428]: argmax_randtie_masking_generic(a, axis=0)
Out[428]: array([2, 1, 1])
In [429]: argmax_randtie_masking_generic(a, axis=0)
Out[429]: array([2, 1, 1])
In [430]: argmax_randtie_masking_generic(a, axis=0)
Out[430]: array([2, 2, 1])
回答2:
A simple way is to add a small random number to all the values at the start, so your data would be like this:
a = np.array([[1.1827,3.1734,3.9187],[4.8172,5.7101,6.9182],[7.1834,8.5012,1.9818]])
That can be done by a = a + np.random.random(a.shape)
.
If you later need to get the original values back, you can do a.astype(int)
to drop the fractional parts.
回答3:
You could use an array of random numbers, the same shape as your input, but mask out the array to only leave the candidates for selection.
import numpy as np
def rndArgMax(a, axis):
a_max = a.max(axis, keepdims=True)
tmp = np.random.random(a.shape) * (a == a_max)
return tmp.argmax(axis)
a = np.random.randint(0, 3, size=(2, 3, 4))
print(rndArgMax(a, 1))
# array([[1, 1, 2, 1],
# [0, 1, 1, 1]])
来源:https://stackoverflow.com/questions/51914697/numpy-arrays-row-column-wise-argmax-with-random-ties