How to get a list of all indices of repeated elements in a numpy array?

后端 未结 6 686
你的背包
你的背包 2020-12-03 03:37

I\'m trying to get the index of all repeated elements in a numpy array, but the solution I found for the moment is REALLY inefficient for a large (>20000 elements) input

相关标签:
6条回答
  • 2020-12-03 04:11

    You could do something along the lines of:

    1. add original index ref so [[1,0],[2,1],[3,2],[1,3],[1,4]...
    2. sort on [:,0]
    3. use np.where(ra[1:,0] != ra[:-1,0])
    4. use the list of indexes from above to construct your final list of lists
    

    EDIT - OK so after my quick reply I've been away for a while and I see I've been voted down which is fair enough as numpy.argsort() is a much better way than my suggestion. I did vote up the numpy.unique() answer as this is an interesting feature. However if you use timeit you will find that

    idx_start = np.where(sorted_records_array[:-1] != sorted_records_array[1:])[0] + 1
    res = np.split(idx_sort, idx_start)
    

    is marginally faster than

    vals, idx_start = np.unique(sorted_records_array, return_index=True)
    res = np.split(idx_sort, idx_start[1:])
    

    Further edit follow question by @Nicolas

    I'm not sure you can. It would be possible to get two arrays of indices in corresponding with the break points but you can't break different 'lines' of the array up into different sized pieces using np.split so

    a = np.array([[4,27,42,12, 4 .. 240, 12], [3,65,23...] etc])
    idx = np.argsort(a, axis=1)
    sorted_a = np.diagonal(a[:, idx[:]]).T
    idx_start = np.where(sorted_a[:,:-1] != sorted_a[:,1:])
    
    # idx_start => (array([0,0,0,..1,1,..]), array([1,4,6,7..99,0,4,5]))
    

    but that might be good enough depending on what you want to do with the information.

    0 讨论(0)
  • 2020-12-03 04:20

    A vectorized solution with numpy, on the magic of unique().

    import numpy as np
    
    # create a test array
    records_array = np.array([1, 2, 3, 1, 1, 3, 4, 3, 2])
    
    # creates an array of indices, sorted by unique element
    idx_sort = np.argsort(records_array)
    
    # sorts records array so all unique elements are together 
    sorted_records_array = records_array[idx_sort]
    
    # returns the unique values, the index of the first occurrence of a value, and the count for each element
    vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
    
    # splits the indices into separate arrays
    res = np.split(idx_sort, idx_start[1:])
    
    #filter them with respect to their size, keeping only items occurring more than once
    vals = vals[count > 1]
    res = filter(lambda x: x.size > 1, res)
    

    The following code was the original answer, which required a bit more memory, using numpy broadcasting and calling unique twice:

    records_array = array([1, 2, 3, 1, 1, 3, 4, 3, 2])
    vals, inverse, count = unique(records_array, return_inverse=True,
                                  return_counts=True)
    
    idx_vals_repeated = where(count > 1)[0]
    vals_repeated = vals[idx_vals_repeated]
    
    rows, cols = where(inverse == idx_vals_repeated[:, newaxis])
    _, inverse_rows = unique(rows, return_index=True)
    res = split(cols, inverse_rows[1:])
    

    with as expected res = [array([0, 3, 4]), array([1, 8]), array([2, 5, 7])]

    0 讨论(0)
  • 2020-12-03 04:24

    @gg349's solution packaged up into a function:

    def better_np_unique(arr):
        sort_indexes = np.argsort(arr)
        arr = np.asarray(arr)[sort_indexes]
        vals, first_indexes, inverse, counts = np.unique(arr,
            return_index=True, return_inverse=True, return_counts=True)
        indexes = np.split(sort_indexes, first_indexes[1:])
        for x in indexes:
            x.sort()
        return vals, indexes, inverse, counts    
    

    It's essentially the same as np.unique but returns all indices, not just the first indices.

    0 讨论(0)
  • 2020-12-03 04:25
    • The answer is complicated, and highly dependent up the size, and number of unique elements.
    • The following, tests arrays with 2M elements and up to 20k unique elements
    • And tests arrays up to 80k elements with a max of 20k unique elements
      • For arrays under 40k elements, the tests have up to half the unique elements as the size of the array (e.g. 10k elements would have up to 5k unique elements).

    Arrays with 2M Elements

    • np.where is faster than defaultdict for up to about 200 unique elements, but slower than pandas.core.groupby.GroupBy.indices, and np.unique.
    • The solution using pandas, is the fastest solution for large arrays.

    Arrays with up to 80k Elements

    • This is more situational, depending on the size of the array and the number of unique elements.
    • defaultdict is a fast option for arrays to about 2400 elements, especially with a large number of unique elements.
    • For arrays larger than 40k elements, and 20k unique elements, pandas is the fastest option.

    %timeit

    import random
    import numpy
    import pandas as pd
    from collections import defaultdict
    
    def dd(l):
        # default_dict test
        indices = defaultdict(list)
        for i, v in enumerate(l):
            indices[v].append(i)
        return indices
    
    
    def npw(l):
        # np_where test
        return {v: np.where(l == v)[0] for v in np.unique(l)}
    
    
    def uni(records_array):
        # np_unique test
        idx_sort = np.argsort(records_array)
        sorted_records_array = records_array[idx_sort]
        vals, idx_start, count = np.unique(sorted_records_array, return_counts=True, return_index=True)
        res = np.split(idx_sort, idx_start[1:])
        return dict(zip(vals, res))
    
    
    def daf(l):
        # pandas test
        return pd.DataFrame(l).groupby([0]).indices
    
    
    data = defaultdict(list)
    
    for x in range(4, 20000, 100):  # number of unique elements
        # create 2M element list
        random.seed(365)
        a = np.array([random.choice(range(x)) for _ in range(2000000)])
        
        res1 = %timeit -r2 -n1 -q -o dd(a)
        res2 = %timeit -r2 -n1 -q -o npw(a)
        res3 = %timeit -r2 -n1 -q -o uni(a)
        res4 = %timeit -r2 -n1 -q -o daf(a)
        
        data['defaut_dict'].append(res1.average)
        data['np_where'].append(res2.average)
        data['np_unique'].append(res3.average)
        data['pandas'].append(res4.average)
        data['idx'].append(x)
    
    df = pd.DataFrame(data)
    df.set_index('idx', inplace=True)
    
    df.plot(figsize=(12, 5), xlabel='unique samples', ylabel='average time (s)', title='%timeit test: 2 run 1 loop each')
    plt.legend(bbox_to_anchor=(1.0, 1), loc='upper left')
    plt.show()
    

    Tests with 2M elements

    Tests with up to 80k elements

    0 讨论(0)
  • 2020-12-03 04:28

    You can also do this:

    a = [1,2,3,1,1,3,4,3,2]
    index_sets = [np.argwhere(i==a) for i in np.unique(a)]
    

    this will give you set of arrays with indices of unique elements.

    [array([[0],[3],[4]], dtype=int64), 
    array([[1],[8]], dtype=int64), 
    array([[2],[5],[7]], dtype=int64), 
    array([[6]], dtype=int64)]
    

    Added: Further change in list comprehension can also discard single unique values and address the speed concern in case of many unique single occurring elements:

    new_index_sets = [np.argwhere(i[0]== a) for i in np.array(np.unique(a, return_counts=True)).T if i[1]>=2]
    

    this gives:

    [array([[0],[3],[4]], dtype=int64), 
     array([[1],[8]], dtype=int64), 
     array([[2],[5],[7]], dtype=int64)]
    
    0 讨论(0)
  • 2020-12-03 04:31

    so I was unable to get rid of the for loop, but I was able to pair it down to using the for loop marginally using the set data type and the list.count() method:

    data = [1,2,3,1,4,5,2,2]
    indivs = set(data)
    
    multi_index = lambda lst, val: [i for i, x in enumerate(lst) if x == val]
    
    if data != list(indivs):
        dupes = [multi_index(data, i) for i in indivs if data.count(i) > 1]
    

    Where you loop over your indivs set, which contains the values (no duplicates) and then loop over the full list if you find an item with a duplicate. Am looking into numpy alternative if this isn't fast enough for you. Generator objects might also speed this up if need be.

    Edit: gg349's answer holds the numpy solution I was working on!

    0 讨论(0)
提交回复
热议问题