问题
I have a an array that is relatively sparse, and I would like to go through each row and shuffle only the non-zero elements.
Example Input:
[2,3,1,0]
[0,0,2,1]
Example Output:
[2,1,3,0]
[0,0,1,2]
Note how the zeros have not changed position.
To shuffle all elements in each row (including zeros) I can do this:
for i in range(len(X)):
np.random.shuffle(X[i, :])
What I tried to do then is this:
for i in range(len(X)):
np.random.shuffle(X[i, np.nonzero(X[i, :])])
But it has no effect. I've noticed that the return type of X[i, np.nonzero(X[i, :])]
is different from X[i, :]
which might be the
cause.
In[30]: X[i, np.nonzero(X[i, :])]
Out[30]: array([[23, 5, 29, 11, 17]])
In[31]: X[i, :]
Out[31]: array([23, 5, 29, 11, 17])
回答1:
You could use the non-inplace numpy.random.permutation with explicit non-zero indexing:
>>> X = np.array([[2,3,1,0], [0,0,2,1]])
>>> for i in range(len(X)):
... idx = np.nonzero(X[i])
... X[i][idx] = np.random.permutation(X[i][idx])
...
>>> X
array([[3, 2, 1, 0],
[0, 0, 2, 1]])
回答2:
I think I found the three-liner?
i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]
回答3:
As promised, this being the fourth day of the bounty period, here's my attempt at a vectorized solution. The steps involved are explained in some details below :
For easy reference, let's call the input array as
a
. Generate unique indices per row that covers the range for row length. For this, we can simply generate random numbers of the same shape as the input array and get theargsort
indices along each row, which would be those unique indices. This idea has been explored before in this post.Index into each row of input array with those indices as columns indices. Thus, we would need advanced-indexing here. Now, this gives us an array with each row being shuffled. Let's call it
b
.Since the shuffling is restricted to per row, if we simply use the boolean-indexing :
b[b!=0]
, we would get the non-zero elements being shuffled and also being restricted to lengths of non-zeros per row. This is because of the fact that the elements in a NumPy array are stored in row-major order, so with boolean-indexing it would have selected shuffled non-zero elements on each row first before moving onto the next row. Again, if we use boolean-indexing similarly fora
, i.e.a[a!=0]
, we would have similarly gotten the non-zero elements on each row first before moving onto the next row and these would be in their original order. So, the final step would be to just grab masked elementsb[b!=0]
and assign into the masked placesa[a!=0]
.
Thus, an implementation covering the above mentioned three steps would be -
m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1) #step1
b = a[np.arange(m)[:,None], rand_idx] #step2
a[a!=0] = b[b!=0] #step3
A sample step-by-step run might make things clearer -
In [50]: a # Input array
Out[50]:
array([[ 8, 5, 0, -4],
[ 0, 6, 0, 3],
[ 8, 5, 0, -4]])
In [51]: m,n = a.shape # Store shape information
# Unique indices per row that covers the range for row length
In [52]: rand_idx = np.random.rand(m,n).argsort(axis=1)
In [53]: rand_idx
Out[53]:
array([[0, 2, 3, 1],
[1, 0, 3, 2],
[2, 3, 0, 1]])
# Get corresponding indexed array
In [54]: b = a[np.arange(m)[:,None], rand_idx]
# Do a check on the shuffling being restricted to per row
In [55]: a[a!=0]
Out[55]: array([ 8, 5, -4, 6, 3, 8, 5, -4])
In [56]: b[b!=0]
Out[56]: array([ 8, -4, 5, 6, 3, -4, 8, 5])
# Finally do the assignment based on masking on a and b
In [57]: a[a!=0] = b[b!=0]
In [58]: a # Final verification on desired result
Out[58]:
array([[ 8, -4, 0, 5],
[ 0, 6, 0, 3],
[-4, 8, 0, 5]])
回答4:
Benchmarking for the vectorized solutions
We are looking to benchmark vectorized solutions in this post. Now, the vectorization tries to avoid the looping that we would loop through each row and do the shuffling. So, the setup for the input array involves a greater number of rows.
Approaches -
def app1(a): # @Daniel F's soln
i, j = np.nonzero(a.astype(bool))
k = np.argsort(i + np.random.rand(i.size))
a[i,j] = a[i,j[k]]
return a
def app2(x): # @kazemakase's soln
r, c = np.where(x != 0)
n = c.size
perm = np.random.permutation(n)
i = np.argsort(perm + r * n)
x[r, c] = x[r, c[i]]
return x
def app3(a): # @Divakar's soln
m,n = a.shape
rand_idx = np.random.rand(m,n).argsort(axis=1)
b = a[np.arange(m)[:,None], rand_idx]
a[a!=0] = b[b!=0]
return a
from scipy.ndimage.measurements import labeled_comprehension
def app4(a): # @FabienP's soln
def func(array, idx):
r[idx] = np.random.permutation(array)
return True
labels, idx = nz = a.nonzero()
r = a[nz]
labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1),\
func, int, 0, pass_positions=True)
a[nz] = r
return a
Benchmarking procedure #1
For a fair benchmarking, it seemed reasonable to use OP's sample and simply stack those as more rows to get a bigger dataset. Thus, with that setup we could create two cases with 2 million and 20 million rows datasets.
Case #1 : Large dataset with 2*1000,000
rows
In [174]: a = np.array([[2,3,1,0],[0,0,2,1]])
In [175]: a = np.vstack([a]*1000000)
In [176]: %timeit app1(a)
...: %timeit app2(a)
...: %timeit app3(a)
...: %timeit app4(a)
...:
1 loop, best of 3: 264 ms per loop
1 loop, best of 3: 422 ms per loop
1 loop, best of 3: 254 ms per loop
1 loop, best of 3: 14.3 s per loop
Case #2 : Larger dataset with 2*10,000,000
rows
In [177]: a = np.array([[2,3,1,0],[0,0,2,1]])
In [178]: a = np.vstack([a]*10000000)
# app4 skipped here as it was slower on the previous smaller dataset
In [179]: %timeit app1(a)
...: %timeit app2(a)
...: %timeit app3(a)
...:
1 loop, best of 3: 2.86 s per loop
1 loop, best of 3: 4.62 s per loop
1 loop, best of 3: 2.55 s per loop
Benchmarking procedure #2 : Extensive one
To cover all cases of varying percentage of non-zeros in the input array, we are covering some extensive benchmarking scenarios. Also, since app4
seemed much slower than others, for a closer inspection we are skipping that one in this section.
Helper function to setup input array :
def in_data(n_col, nnz_ratio):
# max no. of elems that my system can handle, i.e. stretching it to limits.
# The idea is to use this to decide the number of rows and always use
# max. possible dataset size
num_elem = 10000000
n_row = num_elem//n_col
a = np.zeros((n_row, n_col),dtype=int)
L = int(round(a.size*nnz_ratio))
a.ravel()[np.random.choice(a.size, L, replace=0)] = np.random.randint(1,6,L)
return a
Main timing and plotting script (Uses IPython magic functions. So, needs to be run opon copying and pasting onto IPython console) -
import matplotlib.pyplot as plt
# Setup input params
nnz_ratios = np.array([0.2, 0.4, 0.6, 0.8])
n_cols = np.array([4, 5, 8, 10, 15, 20, 25, 50])
init_arr1 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr2 = np.zeros((len(nnz_ratios), len(n_cols) ))
init_arr3 = np.zeros((len(nnz_ratios), len(n_cols) ))
timings = {app1:init_arr1, app2:init_arr2, app3:init_arr3}
for i,nnz_ratio in enumerate(nnz_ratios):
for j,n_col in enumerate(n_cols):
a = in_data(n_col, nnz_ratio=nnz_ratio)
for func in timings:
res = %timeit -oq func(a)
timings[func][i,j] = res.best
print func.__name__, i, j, res.best
fig = plt.figure(1)
colors = ['b','k','r']
for i in range(len(nnz_ratios)):
ax = plt.subplot(2,2,i+1)
for f,func in enumerate(timings):
ax.plot(n_cols,
[time for time in timings[func][i]],
label=str(func.__name__), color=colors[f])
ax.set_xlabel('No. of cols')
ax.set_ylabel('time [seconds]')
ax.grid(which='both')
ax.legend()
plt.tight_layout()
plt.title('Percentage non-zeros : '+str(int(100*nnz_ratios[i])) + '%')
plt.subplots_adjust(wspace=0.2, hspace=0.2)
Timings output -
Observations :
Approaches #1, #2 does
argsort
on the non-zero elements across the entire input array. As such, it performs better with lesser percentage of non-zeros.Approach #3 creates random numbers of the same shape as the input array and then gets
argsort
indices per row. Thus, for a given number of non-zeros in the input, the timings for it are more steep-ish than first two approaches.
Conclusion :
Approach #1 seems to be doing pretty well until 60% non-zero mark. For more non-zeros and if the row-lengths are small, approach #3 seems to be performing decently.
回答5:
I came up with that:
nz = a.nonzero() # Get nonzero indexes
a[nz] = np.random.permutation(a[nz]) # Shuffle nonzero values with mask
Which look simpler (and a little bit faster?) than other proposed solutions.
EDIT: new version that does not mix rows
labels, *idx = nz = a.nonzero() # get masks
a[nz] = np.concatenate([np.random.permutation(a[nz][labels == i]) # permute values
for i in np.unique(labels)]) # for each label
Where the first array of a.nonzero()
(indexes of non zero values in axis0) is used as labels. This is the trick that does not mix rows.
Then np.random.permutation
is applied on a[a.nonzero()]
for each "label" independently.
Supposedly scipy.ndimage.measurements.labeled_comprehension can be used here, by it seems to fail with np.random.permutation
.
And I finally saw that it looks a lot like what @randomir proposed. Anyway, it was just for the challenge of getting it to work.
EDIT2:
Finally got it working with scipy.ndimage.measurements.labeled_comprehension
def shuffle_rows(a):
def func(array, idx):
r[idx] = np.random.permutation(array)
return True
labels, *idx = nz = a.nonzero()
r = a[nz]
labeled_comprehension(a[nz], labels + 1, np.unique(labels + 1), func, int, 0, pass_positions=True)
a[nz] = r
return a
Where:
func()
shuffles the non zero valueslabeled_comprehension
appliesfunc()
label-wise
This replaces the previous for loop and will be faster on arrays with many rows.
回答6:
This is one possibility for a vectorized solution:
r, c = np.where(x > 0)
n = c.size
perm = np.random.permutation(n)
i = np.argsort(perm + r * n)
x[r, c] = x[r, c[i]]
The challenge in vectorizing this problem is that np.random.permutation
gives only flat indices, which would shuffle the array elements across rows. Sorting the permuted values with an offset added makes sure no shuffling across rows occurs.
回答7:
Here's your two liner without needing to install numpy.
from random import random
def shuffle_nonzeros(input_list):
''' returns a list with the non-zero values shuffled '''
shuffled_nonzero = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
print([i for i in (i if i==0 else next(shuffled_nonzero) for i in input_list)])
if you dont like one liners though, you can either make this a generator with
def shuffle_nonzeros(input_list):
''' generator that yields a list with the non-zero values shuffled '''
random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
for i in iterable:
if i==0:
yield i
else:
yield next(random_nonzero_values)
or if you want a list as the output, and dont like one line comprehensions
def shuffle_nonzeros(input_list):
''' returns a list with the non-zero values shuffled '''
out = []
random_nonzero_values = iter(sorted((i for i in input_list if i!=0), key=lambda k: random()))
for i in iterable:
if i==0:
out.append(i)
else:
out.append(next(random_nonzero_values))
return out
来源:https://stackoverflow.com/questions/45764955/shuffling-non-zero-elements-of-each-row-in-an-array-python-numpy