Numpy arrays: row/column wise argmax with random ties

问题

Here is what I am trying to do with Numpy in Python 2.7. Suppose I have an array a defined by the following:

a = np.array([[1,3,3],[4,5,6],[7,8,1]])

I can do a.argmax(0) or a.argmax(1) to get the row/column wise argmax:

a.argmax(0)
Out[329]: array([2, 2, 1], dtype=int64)
a.argmax(1)
Out[330]: array([1, 2, 1], dtype=int64)

However, when there is a tie like in a's first row, I would like to get the argmax decided randomly between the ties (by default, Numpy returns the first element whenever a tie occurs in argmax or argmin).

Last year, someone put a question on solving Numpy argmax/argmin ties randomly: Select One Element in Each Row of a Numpy Array by Column Indices

However, the question aimed at uni-dimensional arrays. There, the most voted answer works well for that. There is a second answer that attempts to solve the problem also for multidimensional arrays but doesn't work - i.e. it does not return, for each row/column the index of the maximum value with ties solved randomly.

What would be the most performent way to do that, since I am working with big arrays?

回答1:

Generic case solution to pick one per group

To solve a general case of picking a random number from a list/array of numbers that specify the ranges for the picks, we would use a trick of creating a uniform rand array, add offset specified by the interval lengths and then perform argsort. The implementation would look something like this -

def random_num_per_grp(L):
    # For each element in L pick a random number within range specified by it
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset] - offset

Sample case -

In [217]: L = [5,4,2]

In [218]: random_num_per_grp(L) # i.e. select one per [0-5,0-4,0-2]
Out[218]: array([2, 0, 1])

So, the output would have same number of elements as in input L and the first output element would be in [0,5), second in [0,4) and so on.

Solving our problem here

To solve our case here, we would use a modified version (specifically remove the offset removal part at the end of the func, like so -

def random_num_per_grp_cumsumed(L):
    # For each element in L pick a random number within range specified by it
    # The final output would be a cumsumed one for use with indexing, etc.
    r1 = np.random.rand(np.sum(L)) + np.repeat(np.arange(len(L)),L)
    offset = np.r_[0,np.cumsum(L[:-1])]
    return r1.argsort()[offset]

Approach #1

One solution could use it like so -

def argmax_per_row_randtie(a):
    max_mask = a==a.max(1,keepdims=1)
    m,n = a.shape
    all_argmax_idx = np.flatnonzero(max_mask)
    offset = np.arange(m)*n
    return all_argmax_idx[random_num_per_grp_cumsumed(max_mask.sum(1))] - offset

Verification

Let's test out on the given sample with a huge number of runs and count number of occurences for each index for each row

In [235]: a
Out[235]: 
array([[1, 3, 3],
       [4, 5, 6],
       [7, 8, 1]])

In [225]: all_out = np.array([argmax_per_row_randtie(a) for i in range(10000)])

# The first element (row=0) should have similar probabilities for 1 and 2
In [236]: (all_out[:,0]==1).mean()
Out[236]: 0.504

In [237]: (all_out[:,0]==2).mean()
Out[237]: 0.496

# The second element (row=1) should only have 2
In [238]: (all_out[:,1]==2).mean()
Out[238]: 1.0

# The third element (row=2) should only have 1
In [239]: (all_out[:,2]==1).mean()
Out[239]: 1.0

Approach #2 : Use masking for performance

We could make use of masking and hence avoid that flatnonzero with the intention of gaining performance as working with boolean arrays generally is. Also, we would generalize to cover both rows (axis=1) and columns(axis=0) to give ourselves a modified one, like so -

def argmax_randtie_masking_generic(a, axis=1): 
    max_mask = a==a.max(axis=axis,keepdims=True)
    m,n = a.shape
    L = max_mask.sum(axis=axis)
    set_mask = np.zeros(L.sum(), dtype=bool)
    select_idx = random_num_per_grp_cumsumed(L)
    set_mask[select_idx] = True
    if axis==0:
        max_mask.T[max_mask.T] = set_mask
    else:
        max_mask[max_mask] = set_mask
    return max_mask.argmax(axis=axis)

Sample runs on axis=0 and axis=1 -

In [423]: a
Out[423]: 
array([[1, 3, 3],
       [4, 5, 6],
       [7, 8, 1]])
In [424]: argmax_randtie_masking_generic(a, axis=1)
Out[424]: array([1, 2, 1])

In [425]: argmax_randtie_masking_generic(a, axis=1)
Out[425]: array([2, 2, 1])

In [426]: a[1,1] = 8

In [427]: a
Out[427]: 
array([[1, 3, 3],
       [4, 8, 6],
       [7, 8, 1]])

In [428]: argmax_randtie_masking_generic(a, axis=0)
Out[428]: array([2, 1, 1])

In [429]: argmax_randtie_masking_generic(a, axis=0)
Out[429]: array([2, 1, 1])

In [430]: argmax_randtie_masking_generic(a, axis=0)
Out[430]: array([2, 2, 1])

回答2:

A simple way is to add a small random number to all the values at the start, so your data would be like this:

a = np.array([[1.1827,3.1734,3.9187],[4.8172,5.7101,6.9182],[7.1834,8.5012,1.9818]])

That can be done by a = a + np.random.random(a.shape).

If you later need to get the original values back, you can do a.astype(int) to drop the fractional parts.

回答3:

You could use an array of random numbers, the same shape as your input, but mask out the array to only leave the candidates for selection.

import numpy as np

def rndArgMax(a, axis):
    a_max = a.max(axis, keepdims=True)
    tmp = np.random.random(a.shape) * (a == a_max)
    return tmp.argmax(axis)

a = np.random.randint(0, 3, size=(2, 3, 4))
print(rndArgMax(a, 1))
# array([[1, 1, 2, 1],
#        [0, 1, 1, 1]])

来源：https://stackoverflow.com/questions/51914697/numpy-arrays-row-column-wise-argmax-with-random-ties

标签

python

arrays

numpy

random

argmax