As a challenge, I\'ve given myself this problem:
Given 2 lists, A, and B, where B is a shuffled version of A, the idea is to figure out the shuffled indices.
LOL
pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
#array([2, 3, 0, 1])
The numpy_indexed package has an efficient and general solution to this:
import numpy_indexed as npi
result = npi.indices(A, B)
Note that it has a kwarg to set a mode for dealing with missing values; and it works with nd-arrays of any type just the same, as it does with 1d integer arrays.
As mentioned in my question, I was able to solve this using a dictionary. I store the indices in a dict
and then use a list comprehension to pop them out:
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
idx = {}
for i, l in enumerate(A):
idx.setdefault(l, []).append(i)
res = [idx[l].pop() for l in B]
print(res)
Output:
[2, 3, 0, 1]
This is better than the obvious [A.index(x) for x in B]
because it is
We can make use of np.searchsorted with its optional sorter
argument -
sidx = np.argsort(B)
out = sidx[np.searchsorted(B,A, sorter=sidx)]
Sample run -
In [19]: A = [10, 40, 30, 2, 40]
...: B = [30, 2, 10, 40]
...:
In [20]: sidx = np.argsort(B)
In [21]: sidx[np.searchsorted(B,A, sorter=sidx)]
Out[21]: array([2, 3, 0, 1, 3])
As an improvement over your current solution, you could use collections.defaultdict and avoid dict.setdefault:
from collections import defaultdict
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
idx = defaultdict(list)
for i, l in enumerate(A):
idx[l].append(i)
res = [idx[l].pop() for l in B]
print(res)
Here are the timings for the two methods using the sample input given:
Script used for testing
from timeit import timeit
setup = """
from collections import defaultdict;
idx1 = defaultdict(list); idx2 = {}
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
"""
me = """
for i, l in enumerate(A):
idx1[l].append(i)
res = [idx1[l].pop() for l in B]
"""
coldspeed = """
for i, l in enumerate(A):
idx2.setdefault(l, []).append(i)
res = [idx2[l].pop() for l in B]
"""
print(timeit(setup=setup, stmt=me))
print(timeit(setup=setup, stmt=coldspeed))
Results
original: 2.601998388010543
modified: 2.0607256239745766
So it appears that using defaultdict
actually yields a slight speed increase. This actually makes since though since defaultdict
is implemented in C rather than Python. Not to mention that the attribute lookup of the original solution - idx.setdefault1
- is costly.
Since several very nice solutions were posted, I've taken the liberty of assembling some crude timings to compare each method.
Script used for testing
from timeit import timeit
setup = """
from collections import defaultdict
import pandas as pd
import numpy as np
idx1 = defaultdict(list); idx2 = {}
A = [10, 40, 30, 2]
B = [30, 2, 10, 40]
"""
me = """
for i, l in enumerate(A):
idx1[l].append(i)
res = [idx1[l].pop() for l in B]
"""
coldspeed = """
for i, l in enumerate(A):
idx2.setdefault(l, []).append(i)
res = [idx2[l].pop() for l in B]
"""
divakar = """
sidx = np.argsort(B)
res = sidx[np.searchsorted(B,A, sorter=sidx)]
"""
dyz = """
res = pd.Series(A).reset_index().set_index(0).ix[B].T.values[0]
"""
print('mine:', timeit(setup=setup, stmt=me, number=1000))
print('coldspeed:', timeit(setup=setup, stmt=coldspeed, number=1000))
print('divakar:', timeit(setup=setup, stmt=divakar, number=1000))
print('dyz:', timeit(setup=setup, stmt=dyz, number=1000))
Result/Output (run on Jupyter notebook server. 1000 loops)
mine: 0.0026700650341808796
coldspeed: 0.0029303128831088543
divakar: 0.02583012101240456
dyz: 2.208147854078561
Here are some timings where the size of A
is 100,000 random numbers. And B
is its shuffled equivalent. The program was just too time and memory consuming. Also I had to reduce the number of loops to 100. Otherwise, everything is the same as above:
mine: 17.663535300991498
coldspeed: 17.11006522300886
divakar: 8.73397267702967
dyz: 44.61878849985078