Determine the shuffled indices of two lists/arrays

后端 未结 6 592
野的像风
野的像风 2021-01-18 16:03

As a challenge, I\'ve given myself this problem:

Given 2 lists, A, and B, where B is a shuffled version of A, the idea is to figure out the shuffled indices.

相关标签:
6条回答
  • 2021-01-18 16:23

    LOL

    pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
    #array([2, 3, 0, 1])
    
    0 讨论(0)
  • 2021-01-18 16:23

    The numpy_indexed package has an efficient and general solution to this:

    import numpy_indexed as npi
    result = npi.indices(A, B)
    

    Note that it has a kwarg to set a mode for dealing with missing values; and it works with nd-arrays of any type just the same, as it does with 1d integer arrays.

    0 讨论(0)
  • 2021-01-18 16:24

    As mentioned in my question, I was able to solve this using a dictionary. I store the indices in a dict and then use a list comprehension to pop them out:

    A = [10, 40, 30, 2]
    B = [30, 2, 10, 40]
    
    idx = {}
    for i, l in enumerate(A):
        idx.setdefault(l, []).append(i)
    
    res = [idx[l].pop() for l in B]
    print(res)
    

    Output:

    [2, 3, 0, 1]
    

    This is better than the obvious [A.index(x) for x in B] because it is

    1. linear
    2. handles duplicates gracefully
    0 讨论(0)
  • 2021-01-18 16:26

    We can make use of np.searchsorted with its optional sorter argument -

    sidx = np.argsort(B)
    out = sidx[np.searchsorted(B,A, sorter=sidx)]
    

    Sample run -

    In [19]: A = [10, 40, 30, 2, 40]
        ...: B = [30, 2, 10, 40]
        ...: 
    
    In [20]: sidx = np.argsort(B)
    
    In [21]: sidx[np.searchsorted(B,A, sorter=sidx)]
    Out[21]: array([2, 3, 0, 1, 3])
    
    0 讨论(0)
  • 2021-01-18 16:31

    As an improvement over your current solution, you could use collections.defaultdict and avoid dict.setdefault:

    from collections import defaultdict
    
    A = [10, 40, 30, 2]
    B = [30, 2, 10, 40]
    
    idx = defaultdict(list)
    for i, l in enumerate(A):
        idx[l].append(i)
    
    res = [idx[l].pop() for l in B]
    print(res)
    

    Here are the timings for the two methods using the sample input given:

    Script used for testing

    from timeit import timeit
    
    
    setup = """
    from collections import defaultdict;
    idx1 = defaultdict(list); idx2 = {}
    A = [10, 40, 30, 2]
    B = [30, 2, 10, 40]
    """
    
    me = """
    for i, l in enumerate(A):
        idx1[l].append(i)
    res = [idx1[l].pop() for l in B]
    """
    
    coldspeed = """
    for i, l in enumerate(A):
        idx2.setdefault(l, []).append(i)
    res = [idx2[l].pop() for l in B]
    """
    
    print(timeit(setup=setup, stmt=me))
    print(timeit(setup=setup, stmt=coldspeed))
    

    Results

    original: 2.601998388010543
    modified: 2.0607256239745766
    

    So it appears that using defaultdict actually yields a slight speed increase. This actually makes since though since defaultdict is implemented in C rather than Python. Not to mention that the attribute lookup of the original solution - idx.setdefault1 - is costly.

    0 讨论(0)
  • 2021-01-18 16:41

    Since several very nice solutions were posted, I've taken the liberty of assembling some crude timings to compare each method.

    Script used for testing

    from timeit import timeit
    
    
    setup = """
    from collections import defaultdict
    import pandas as pd 
    import numpy as np 
    idx1 = defaultdict(list); idx2 = {}
    A = [10, 40, 30, 2]
    B = [30, 2, 10, 40]
    """
    
    me = """
    for i, l in enumerate(A):
        idx1[l].append(i)
    res = [idx1[l].pop() for l in B]
    """
    
    coldspeed = """
    for i, l in enumerate(A):
        idx2.setdefault(l, []).append(i)
    res = [idx2[l].pop() for l in B]
    """
    
    divakar = """
    sidx = np.argsort(B)
    res = sidx[np.searchsorted(B,A, sorter=sidx)]
    """
    
    dyz = """
    res = pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
    """
    
    print('mine:', timeit(setup=setup, stmt=me, number=1000))
    print('coldspeed:', timeit(setup=setup, stmt=coldspeed, number=1000))
    print('divakar:', timeit(setup=setup, stmt=divakar, number=1000))
    print('dyz:', timeit(setup=setup, stmt=dyz, number=1000))
    

    Result/Output (run on Jupyter notebook server. 1000 loops)

    mine: 0.0026700650341808796
    coldspeed: 0.0029303128831088543
    divakar: 0.02583012101240456
    dyz: 2.208147854078561
    

    Here are some timings where the size of A is 100,000 random numbers. And B is its shuffled equivalent. The program was just too time and memory consuming. Also I had to reduce the number of loops to 100. Otherwise, everything is the same as above:

    mine: 17.663535300991498
    coldspeed: 17.11006522300886
    divakar: 8.73397267702967
    dyz: 44.61878849985078
    
    0 讨论(0)
提交回复
热议问题