searching sorted items into a sorted sequence

前端 未结 1 988
悲哀的现实
悲哀的现实 2021-01-20 05:14

I want to find a sequence of items in a sorted array of values. I know that with numpy I can do:

l = np.searchsorted(values, items)

This ha

相关标签:
1条回答
  • 2021-01-20 05:27

    Some example data:

    # some example data
    np.random.seed(0)
    n_values = 1000000
    n_items = 100000
    values = np.random.rand(n_values)
    items = np.random.rand(n_items)
    values.sort()
    items.sort()
    

    Your original code snippet as well as an implementation of @PeterE's suggestion:

    def original(values, items):
        l = np.empty(items.size, dtype=np.int32)
        k, K = 0, len(values)
        for i, item in enumerate(items):
            while k < K and values[k] < item:
                k += 1
            l[i] = k
        return l
    
    def peter_e(values, items):
        l = np.empty(items.size, dtype=np.int32)
        last_idx = 0
        for i, item in enumerate(items):
            last_idx += values[last_idx:].searchsorted(item)
            l[i] = last_idx
        return l
    

    Test for correctness against naive np.searchsorted:

    ss = values.searchsorted(items)
    
    print(all(original(values, items) == ss))
    # True
    
    print(all(peter_e(values, items) == ss))
    # True
    

    Timings:

    In [1]: %timeit original(values, items)
    10 loops, best of 3: 115 ms per loop
    
    In [2]: %timeit peter_e(values, items)
    10 loops, best of 3: 79.8 ms per loop
    
    In [3]: %timeit values.searchsorted(items)
    100 loops, best of 3: 4.09 ms per loop
    

    So for inputs of this size, naive use of np.searchsorted handily beats your original code, as well as PeterE's suggestion.

    Update

    To avoid any caching effects that might skew the timings, we can generate a new set of random input arrays for each iteration of the benchmark:

    In [1]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
    original(values, items)
       .....: 
    10 loops, best of 3: 115 ms per loop
    
    In [2]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
    peter_e(values, items)
       .....: 
    10 loops, best of 3: 79.9 ms per loop
    
    In [3]: %%timeit values = np.random.randn(n_values); items = np.random.randn(n_items); values.sort(); items.sort();
    values.searchsorted(items)
       .....: 
    100 loops, best of 3: 4.08 ms per loop
    

    Update 2

    It's not that hard to write a Cython function that will beat np.searchsorted for the case where both values and items are sorted.

    search_doubly_sorted.pyx:

    import numpy as np
    cimport numpy as np
    cimport cython
    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    def search_doubly_sorted(values, items):
    
        cdef:
            double[:] _values = values.astype(np.double)
            double[:] _items = items.astype(np.double)
            long n_items = items.shape[0]
            long n_values = values.shape[0]
            long[:] out = np.empty(n_items, dtype=np.int64)
            long ii, jj, last_idx
    
        last_idx = 0
        for ii in range(n_items):
            for jj in range(last_idx, n_values):
                 if _items[ii] <= _values[jj]:
                    break
            last_idx = jj
            out[ii] = last_idx
    
        return out.base
    

    Test for correctness:

    In [1]: from search_doubly_sorted import search_doubly_sorted
    
    In [2]: print(all(search_doubly_sorted(values, items) == values.searchsorted(items)))                     
    # True
    

    Benchmark:

    In [3]: %timeit values.searchsorted(items)
    100 loops, best of 3: 4.07 ms per loop
    
    In [4]: %timeit search_doubly_sorted(values, items)
    1000 loops, best of 3: 1.44 ms per loop
    

    The performance improvement is fairly marginal, though. Unless this is a serious bottleneck in your code then you should probably stick with np.searchsorted.

    0 讨论(0)
提交回复
热议问题