Converting NumPy array to a set takes too long

后端 未结 2 1334
别那么骄傲
别那么骄傲 2020-12-06 01:56

I\'m trying to execute the following

from numpy import *
x = array([[3,2,3],[711,4,104],.........,[4,4,782,7845]])  # large nparray
for item in x:
    set(it         


        
相关标签:
2条回答
  • 2020-12-06 02:10

    TL;DR: The set() function creates a set using Pythons iteration protocol. But iterating (on the Python level) over NumPy arrays is so slow that using tolist() to convert the array to a Python list before doing the iteration is (much) faster.

    To understand why iterating over NumPy arrays is so slow it's important to know how Python objects, Python lists, and NumPy arrays are stored in memory.

    A Python object needs some bookkeeping properties (like the reference count, a link to its class, ...) and the value it represents. For example the integer ten = 10 could look like this:

    The blue circle is the "name" you use in the Python interpreter for the variable ten and the lower object (instance) is what actually represents the integer (since the bookkeeping properties aren't imporant here I ignored them in the images).

    A Python list is just a collection of Python objects, for example mylist = [1, 2, 3] would be saved like this:

    This time the list references the Python integers 1, 2 and 3 and the name mylist just references the list instance.

    But an array myarray = np.array([1, 2, 3]) doesn't store Python objects as elements:

    The values 1, 2 and 3 are stored directly in the NumPy array instance.


    With this information I can explain why iterating over an array is so much slower compared to an iteration over a list:

    Each time you access the next element in a list the list just returns a stored object. That's very fast because the element already exists as Python object (it just needs to increment the reference count by one).

    On the other hand when you want an element of an array it needs to create a new Python "box" for the value with all the bookkeeping stuff before it is returned. When you iterate over the array it needs to create one Python box for each element in your array:

    Creating these boxes is slow and the main reason why iterating over NumPy arrays is much slower than iterating over Python collections (lists/tuples/sets/dictionaries) which store the values and their box:

    import numpy as np
    arr = np.arange(100000)
    lst = list(range(100000))
    
    def iterateover(obj):
        for item in obj:
            pass
    
    %timeit iterateover(arr)
    # 20.2 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit iterateover(lst)
    # 3.96 ms ± 26.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    The set "constructor" just does an iteration over the object.

    One thing I can't answer definitely is why the tolist method is so much faster. In the end each value in the resulting Python list needs to be in a "Python box" so there's not much work that tolist could avoid. But one thing I know for sure is that list(array) is slower than array.tolist():

    arr = np.arange(100000)
    
    %timeit list(arr)
    # 20 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit arr.tolist()
    # 10.3 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Each of these has O(n) runtime complexity but the constant factors are very different.

    In your case you did compare set() to tolist() - which isn't a particular good comparison. It would make more sense to compare set(arr) to list(arr) or set(arr.tolist()) to arr.tolist():

    arr = np.random.randint(0, 1000, (10000, 3))
    
    def tosets(arr):
        for line in arr:
            set(line)
    
    def tolists(arr):
        for line in arr:
            list(line)
    
    def tolists_method(arr):
        for line in arr:
            line.tolist()
    
    def tosets_intermediatelist(arr):
        for line in arr:
            set(line.tolist())
    
    %timeit tosets(arr)
    # 72.2 ms ± 2.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit tolists(arr)
    # 80.5 ms ± 2.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    %timeit tolists_method(arr)
    # 16.3 ms ± 140 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    %timeit tosets_intermediatelist(arr)
    # 38.5 ms ± 200 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    So if you want sets you are better off using set(arr.tolist()). For bigger arrays it could make sense to use np.unique but because your rows only contain 3 items that will likely be slower (for thousands of elements it could be much faster!).


    In the comments you asked about numba and yes, it's true that numba could speed this up. Numba supports typed sets (only numeric types), but that doesn't mean it will be always faster.

    I'm not sure how numba (re-)implements sets but because they are typed it's likely they also avoid the "Python boxes" and store the values directly inside the set:

    Sets are more complicated than lists because it they involve hashes and empty slots (Python uses open-addressing for sets, so I assume numba will too).

    Like the NumPy array the numba set saves the values directly. So when you convert a NumPy array to a numba set (or vise-versa) it won't need to use "Python boxes" at all, so when you create the sets in a numba nopython function it will be much faster even than the set(arr.tolist()) operation:

    import numba as nb
    @nb.njit
    def tosets_numba(arr):
        for lineno in range(arr.shape[0]):
            set(arr[lineno])
    
    tosets_numba(arr)  # warmup
    %timeit tosets_numba(arr)
    # 6.55 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    That's roughly five times faster than the set(arr.tolist()) approach. But it's important to highlight that I did not return the sets from the function. When you return a set from a nopython numba function to Python Numba creates a python set - including "creating the boxes" for all values in the set (that's something numba is hiding).

    Just FYI: The same boxing/unboxing happens if you pass lists to Numba nopython functions or return lists from these functions. So what's a O(1) operation in Python is an O(n) operation with Numba! That's why it's generally better to pass NumPy arrays to numba nopython function (which is O(1)).

    I assume that if you return these sets from the function (not really possible right now because numba doesn't support lists of sets currently) it would be slower (because it creates a numba set and later converts it to a python set) or only marginally faster (if the conversion numbaset -> pythonset is really, really fast).

    Personally I would use numba for sets only if I don't need to return them from the function and do all operations on the set inside the function and only if all the operations on the set are supported in nopython mode. In any other case I wouldn't use numba here.


    Just a note: from numpy import * should be avoided, you hide several python built-in functions when you do that (sum, min, max, ...) and it puts a lot of stuff into your globals. Better to use import numpy as np. The np. in front of function calls makes the code clearer and isn't much to type.

    0 讨论(0)
  • 2020-12-06 02:12

    Here is a way to speed things up: avoid the loop and use a multiprocessing pool.map trick

    from multiprocessing.dummy import Pool as ThreadPool
    import multiprocessing
    
    pool = ThreadPool(multiprocessing.cpu_count()) # get the number of CPU
    y = pool.map(set,x) # apply the function to your iterable
    pool.close()
    pool.join()
    
    0 讨论(0)
提交回复
热议问题