Fastest way to convert a list of indices to 2D numpy array of ones

前端 未结 6 1547
星月不相逢
星月不相逢 2021-01-05 15:11

I have a list of indices

a = [
  [1,2,4],
  [0,2,3],
  [1,3,4],
  [0,2]]

What\'s the fastest way to convert this to a numpy array of ones,

相关标签:
6条回答
  • 2021-01-05 15:30

    In case you can and want to use Cython you can create a readable (at least if you don't mind the typing) and fast solution.

    Here I'm using the IPython bindings of Cython to compile it in a Jupyter notebook:

    %load_ext cython
    
    %%cython
    
    cimport cython
    cimport numpy as cnp
    import numpy as np
    
    @cython.boundscheck(False)  # remove this if you cannot guarantee that nrow/ncol are correct
    @cython.wraparound(False)
    cpdef cnp.int_t[:, :] mseifert(list a, int nrow, int ncol):
        cdef cnp.int_t[:, :] out = np.zeros([nrow, ncol], dtype=int)
        cdef list subl
        cdef int row_idx
        cdef int col_idx
        for row_idx, subl in enumerate(a):
            for col_idx in subl:
                out[row_idx, col_idx] = 1
        return out
    

    To compare the performance of the solutions presented here I use my library simple_benchmark:

    Note that this uses logarithmic axis to simultaneously show the differences for small and large arrays. According to my benchmark my function is actually the fastest of the solutions, however it's also worth pointing out that all of the solutions aren't too far off.

    Here is the complete code I used for the benchmark:

    import numpy as np
    from simple_benchmark import BenchmarkBuilder, MultiArgument
    import itertools
    
    b = BenchmarkBuilder()
    
    @b.add_function()
    def pp(a, nrow, ncol):
        sz = np.fromiter(map(len, a), int, nrow)
        out = np.zeros((nrow, ncol), int)
        out[np.arange(nrow).repeat(sz), np.fromiter(itertools.chain.from_iterable(a), int, sz.sum())] = 1
        return out
    
    @b.add_function()
    def ts(a, nrow, ncol):
        out = np.zeros((nrow, ncol), int)
        for i, ix in enumerate(a):
            out[i][ix] = 1
        return out
    
    @b.add_function()
    def u9(a, nrow, ncol):
        out = np.zeros((nrow, ncol), int)
        for i, (x, y) in enumerate(zip(a, out)):
            y[x] = 1
            out[i] = y
        return out
    
    b.add_functions([mseifert])
    
    @b.add_arguments("number of rows/columns")
    def argument_provider():
        for n in range(2, 13):
            ncols = 2**n
            a = [
                sorted(set(np.random.randint(0, ncols, size=np.random.randint(0, ncols)))) 
                for _ in range(ncols)
            ]
            yield ncols, MultiArgument([a, ncols, ncols])
    
    r = b.run()
    r.plot()
    
    0 讨论(0)
  • 2021-01-05 15:40

    How about this:

    ncol = 5
    nrow = len(a)
    out = np.zeros((nrow, ncol), int)
    out[np.arange(nrow).repeat([*map(len,a)]), np.concatenate(a)] = 1
    out
    # array([[0, 1, 1, 0, 1],
    #        [1, 0, 1, 1, 0],
    #        [0, 1, 0, 1, 1],
    #        [1, 0, 1, 0, 0]])
    

    Here are timings for a 1000x1000 binary array, note that I use an optimized version of the above, see function pp below:

    pp 21.717635259992676 ms
    ts 37.10938713003998 ms
    u9 37.32933565042913 ms
    

    Code to produce timings:

    import itertools as it
    import numpy as np
    
    def make_data(n,m):
        I,J = np.where(np.random.random((n,m))<np.random.random((n,1)))
        return [*map(np.ndarray.tolist, np.split(J, I.searchsorted(np.arange(1,n))))]
    
    def pp():
        sz = np.fromiter(map(len,a),int,nrow)
        out = np.zeros((nrow,ncol),int)
        out[np.arange(nrow).repeat(sz),np.fromiter(it.chain.from_iterable(a),int,sz.sum())] = 1
        return out
    
    def ts():
        out = np.zeros((nrow,ncol),int)
        for i, ix in enumerate(a):
            out[i][ix] = 1
        return out
    
    def u9():
        out = np.zeros((nrow,ncol),int)
        for i, (x, y) in enumerate(zip(a, out)):
            y[x] = 1
            out[i] = y
        return out
    
    nrow,ncol = 1000,1000
    a = make_data(nrow,ncol)
    
    from timeit import timeit
    assert (pp()==ts()).all()
    assert (pp()==u9()).all()
    
    print("pp", timeit(pp,number=100)*10, "ms")
    print("ts", timeit(ts,number=100)*10, "ms")
    print("u9", timeit(u9,number=100)*10, "ms")
    
    0 讨论(0)
  • 2021-01-05 15:46

    Depending on your use case, you might look into using sparse matrices. The input matrix looks suspiciously like a Compressed Sparse Row (CSR) matrix. Perhaps something like

    import numpy as np
    from scipy.sparse import csr_matrix
    from itertools import accumulate
    
    
    def ragged2csr(inds):
        offset = len(inds[0])
        lens = [len(x) for x in inds]
        indptr = list(accumulate(lens))
        indptr = np.array([x - offset for x in indptr])
        indices = np.array([val for sublist in inds for val in sublist])
        n = indices.size
        data = np.ones(n)
        return csr_matrix((data, indices, indptr))
    
    

    Again, if it fits in your use case, a sparse matrix would allow elementwise/masking operations to scale with the number of nonzeros, rather than the number of elements (rows*columns), which could bring significant speedup (for a sparse enough matrix).

    Another good introduction to CSR matrices is section 3.4 of Iterative Methods. In this case, data is aa, indices is ja and indptr is ia. This format also has the benefit of being very popular among different packages/libraries.

    0 讨论(0)
  • 2021-01-05 15:49

    May not be the best way but the only way I can think of:

    output = np.zeros((4,5))
    for i, (x, y) in enumerate(zip(a, output)):
        y[x] = 1
        output[i] = y
    print(output)
    

    Which outputs:

    [[ 0.  1.  1.  0.  1.]
     [ 1.  0.  1.  1.  0.]
     [ 0.  1.  0.  1.  1.]
     [ 1.  0.  1.  0.  0.]]
    
    0 讨论(0)
  • 2021-01-05 15:51

    This might not be the fastest way. You will need to compare execution times of these answers using large arrays in order to find out the fastest way. Here's my solution

    output = np.zeros((4,5))
    for i, ix in enumerate(a):
        output[i][ix] = 1
    
    # output -> 
    #   array([[0, 1, 1, 0, 1],
    #   [1, 0, 1, 1, 0],
    #   [0, 1, 0, 1, 1],
    #   [1, 0, 1, 0, 0]])
    
    0 讨论(0)
  • 2021-01-05 15:51

    How about using array indexing? If you knew more about your input, you could get rid of the penalty for having to convert to a linear array first.

    import numpy as np
    
    
    def main():
        row_count = 4
        col_count = 5
        a = [[1,2,4],[0,2,3],[1,3,4],[0,2]]
    
        # iterate through each row, concatenate all indices and convert them to linear
    
        # numpy append performs copy even if you don't want it, list append is faster
        b = []
        for row_idx, row in enumerate(a):
            b.append(np.array(row, dtype=np.int64) + (row_idx * col_count))
    
        linear_idxs = np.hstack(b)
        #could skip previous steps if given index inputs well before hand, or in linear index order. 
        c = np.zeros(row_count * col_count)
        c[linear_idxs] = 1
        c = c.reshape(row_count, col_count)
        print(c)
    
    
    if __name__ == "__main__":
        main()
    
    #output
    # [[0. 1. 1. 0. 1.]
    #  [1. 0. 1. 1. 0.]
    #  [0. 1. 0. 1. 1.]
    #  [1. 0. 1. 0. 0.]]
    
    0 讨论(0)
提交回复
热议问题