Grouping indices of unique elements in numpy

前端 未结 5 583
伪装坚强ぢ
伪装坚强ぢ 2021-01-17 17:15

I have many large (>100,000,000) lists of integers that contain many duplicates. I want to get the indices where each of the element occur. Currently I am doing something li

相关标签:
5条回答
  • 2021-01-17 17:38

    This is very similar to what was asked here, so what follows is an adaptation of my answer there. The simplest way to vectorize this is to use sorting. The following code borrows a lot from the implementation of np.unique for the upcoming version 1.9, which includes unique item counting functionality, see here:

    >>> a = np.array([1, 2, 6, 4, 2, 3, 2])
    >>> sort_idx = np.argsort(a)
    >>> a_sorted = a[idx]
    >>> unq_first = np.concatenate(([True], a_sorted[1:] != a_sorted[:-1]))
    >>> unq_items = a_sorted[unq_first]
    >>> unq_count = np.diff(np.nonzero(unq_first)[0])
    

    and now:

    >>> unq_items
    array([1, 2, 3, 4, 6])
    >>> unq_count
    array([1, 3, 1, 1, 1], dtype=int64)
    

    To get the positional indices for each values, we simply do:

    >>> unq_idx = np.split(sort_idx, np.cumsum(unq_count))
    >>> unq_idx
    [array([0], dtype=int64), array([1, 4, 6], dtype=int64), array([5], dtype=int64),
     array([3], dtype=int64), array([2], dtype=int64)]
    

    And you can now construct your dictionary zipping unq_items and unq_idx.

    Note that unq_count doesn't count the occurrences of the last unique item, because that is not needed to split the index array. If you wanted to have all the values you could do:

    >>> unq_count = np.diff(np.concatenate(np.nonzero(unq_first) + ([a.size],)))
    >>> unq_idx = np.split(sort_idx, np.cumsum(unq_count[:-1]))
    
    0 讨论(0)
  • 2021-01-17 17:54
    def to_components(index):
        return np.split(np.argsort(index), np.cumsum(np.unique(index, return_counts=True)[1]))
    
    0 讨论(0)
  • 2021-01-17 17:55

    this can be solved via python pandas (python data analysis library) and a DataFrame.groupby call.

    Consider the following

     a = np.array([1, 2, 6, 4, 2, 3, 2])
    
     import pandas as pd
     df = pd.DataFrame({'a':a})
    
     gg = df.groupby(by=df.a)
     gg.groups
    

    output

     {1: [0], 2: [1, 4, 6], 3: [5], 4: [3], 6: [2]}
    
    0 讨论(0)
  • 2021-01-17 17:56

    The numpy_indexed package (disclaimer: I am its author) implements a solution inspired by Jaime's; but with tests, a nice interface, and a lot of related functionality:

    import numpy_indexed as npi
    unique, idx_groups = npi.group_by(a, np.arange(len(a))
    
    0 讨论(0)
  • 2021-01-17 17:58

    Simple and quick solution.

    a = np.array([0, 0, 0, 1, 1, 3, 3, 3, 2, 2, 2, 0, 0, 1, 4])
    sort_idx = np.argsort(a)
    unique, counts = np.unique(a, return_counts=True)
    b = {key: sort_idx[sum(counts[:key]): sum(counts[:key]) + counts[key]] for key in unique}
    
    0 讨论(0)
提交回复
热议问题