Why is random.sample faster than numpy's random.choice?

后端未结

关注

 1  1015

I need a way to sample without replacement a certain array a. I tried two approaches (see MCVE below), using random.sample() and np.random.ch


                      
              相关标签:


      
      
        
          1条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-01-17 16:12
              
            
            
                                                                       
TL;DR You can use numpy.random.default_rng() object instead of numpy.random. In particular, numpy.random.default_rng().choice(...).

As mentioned in comments, there was a long-standing issue in numpy regarding np.random.choice implementation being ineffective for k << n comparing to standard random.sample.
The problem was that np.random.choice(arr, size=k, replace=False) is implemented as a permutation(arr)[:k]. In case of huge array and small k, computing the whole array permutation is a waste of time and memory. The standard python sample works more straightforward - it just samples without replacement, either keeping track of what is already sampled or from what to sample.
In v1.17.0 numpy introduced rework and improvement of numpy.random package (link, what's new, performance). As it's said in the first link, for backward compatibility the old numpy.random API uses an old implementation, i.e. nothing was changed for an old API.
The new no-brain way to use new API is to use numpy.random.default_rng() object instead of numpy.random. So, in your case it's np.random.default_rng().choice(...). First of all it uses different generator, which probably a bit faster for the most cases. And the second - regarding your case - choice became smarter and uses the whole array permutation only for both sufficiently large arrays (>10000 elems) and relatively large k (>1/50 of size). For the other cases it uses Floyd's sampling algorithm (short description, numpy implementation).

Here's the performance comparison on my laptop:
10000 times 100 samples from array of 10000 elements:
random.sample elapsed: 0.8711776689742692
np.random.choice elapsed: 1.9704092079773545
np.random.default_rng().choice elapsed: 0.818919860990718

10000 times 1000 samples from array of 10000 elements:
random.sample elapsed: 8.785315042012371
np.random.choice elapsed: 1.9777243090211414
np.random.default_rng().choice elapsed: 1.05490942299366

10000 times 10000 samples from array of 10000 elements:
random.sample elapsed: 80.15063399000792
np.random.choice elapsed: 2.0218082449864596
np.random.default_rng().choice elapsed: 2.8596064270241186

And the code I used:
import numpy as np
import random
from timeit import default_timer as timer
from contextlib import contextmanager


@contextmanager
def timeblock(label):
    start = timer()
    try:
        yield
    finally:
        end = timer()
        print ('{} elapsed: {}'.format(label, end - start))


def f1(a, n_sample):
    return random.sample(range(len(a)), n_sample)


def f2(a, n_sample):
    return np.random.choice(len(a), n_sample, replace=False)


def f3(a, n_sample):
    return np.random.default_rng().choice(len(a), n_sample, replace=False)


# Generate random array
a = np.random.uniform(1., 100., 10000)
# Number of samples' indexes to randomly take from a
n_sample = 100
# Number of times to repeat functions f1 and f2
N = 100000

print(f'{N} times {n_sample} samples')
with timeblock("random.sample"):
    for _ in range(N):
        f1(a, n_sample)

with timeblock("np.random.choice"):
    for _ in range(N):
        f2(a, n_sample)

with timeblock("np.random.default_rng().choice"):
    for _ in range(N):
        f3(a, n_sample)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复