I\'m trying to slice a dataframe based on list of values, how would I go about this?
Say I have an expression or a list l = [0,1,0,0,1,1,0,0,0,1]
Ho
Convert the list to a boolean array and then use boolean indexing:
df = pd.DataFrame(np.random.randint(10, size=(10, 3)))
df[np.array(lst).astype(bool)]
Out:
0 1 2
1 8 6 3
4 2 7 3
5 7 2 3
9 1 3 4
yet another "creative" approach:
In [181]: a = np.array(lst)
In [182]: df.query("index * @a > 0")
Out[182]:
0 1 2
1 1 5 5
4 0 2 0
5 4 9 9
9 2 2 5
or much better variant from @ayhan:
In [183]: df.query("@a != 0")
Out[183]:
0 1 2
1 1 5 5
4 0 2 0
5 4 9 9
9 2 2 5
PS i've also borrowed @Ayhan's setup
Or maybe find the position of 1 in your list
and slice from the Dataframe
df.loc[[i for i,x in enumerate(lst) if x == 1],:]
Selecting using a list of Booleans is something itertools.compress does well.
Given
>>> df = pd.DataFrame(np.random.randint(10, size=(10, 2)))
>>> selectors = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1]
Code
>>> selected_idxs = list(itertools.compress(df.index, selectors)) # [1, 4, 5, 9]
>>> df.iloc[selected_idxs, :]
0 1
1 1 9
4 3 4
5 4 1
9 8 9
Setup
Borrowed @ayhan's setup
df = pd.DataFrame(np.random.randint(10, size=(10, 3)))
Without numpy
not the fastest, but it holds its own and is definitely the shortest.
df[list(map(bool, lst))]
0 1 2
1 3 5 6
4 6 3 2
5 5 7 6
9 0 0 1
Timing
results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))
ayh wvo pir mxu wen Best
N
1 1.53 1.00 1.02 4.95 2.61 wvo
3 1.06 1.00 1.04 5.46 2.84 wvo
10 1.00 1.00 1.00 4.30 2.73 ayh
30 1.00 1.05 1.24 4.06 3.76 ayh
100 1.16 1.00 1.19 3.90 3.53 wvo
300 1.29 1.00 1.32 2.50 2.38 wvo
1000 1.54 1.00 2.19 2.24 3.85 wvo
3000 1.39 1.00 2.17 1.81 4.55 wvo
10000 1.22 1.00 2.21 1.35 4.36 wvo
30000 1.19 1.00 2.26 1.39 5.36 wvo
100000 1.19 1.00 2.19 1.31 4.82 wvo
fig, (a1, a2) = plt.subplots(2, 1, figsize=(6, 6))
results.plot(loglog=True, lw=3, ax=a1)
results.div(results.min(1), 0).round(2).plot.bar(logy=True, ax=a2)
fig.tight_layout()
Testing Code
ayh = lambda d, l: d[np.array(l).astype(bool)]
wvo = lambda d, l: d[np.array(l, dtype=bool)]
pir = lambda d, l: d[list(map(bool, l))]
wen = lambda d, l: d.loc[[i for i, x in enumerate(l) if x == 1], :]
def mxu(d, l):
a = np.array(l)
return d.query('@a != 0')
results = pd.DataFrame(
index=pd.Index([1, 3, 10, 30, 100, 300,
1000, 3000, 10000, 30000, 100000], name='N'),
columns='ayh wvo pir mxu wen'.split(),
dtype=float
)
for i in results.index:
d = pd.concat([df] * i, ignore_index=True)
l = lst * i
for j in results.columns:
stmt = '{}(d, l)'.format(j)
setp = 'from __main__ import d, l, {}'.format(j)
results.set_value(i, j, timeit(stmt, setp, number=10))
You can use masking here:
df[np.array([0,1,0,0,1,1,0,0,0,1],dtype=bool)]
So we construct a boolean array with true and false. Every place where the array is True is a row we select.
Mind that we do not filter inplace. In order to retrieve the result, you have to assign the result to an (optionally different) variable:
df2 = df[np.array([0,1,0,0,1,1,0,0,0,1],dtype=bool)]