Got a large dataframe that I want to take slices of (according to multiple boolean criteria), and then modify the entries in those slices in order to change the original datafra
Even though df.loc[idx]
may be a copy of a portion of df
, assignment to df.loc[idx] modifies df
itself. (This is also true of df.iloc
and df.ix
.)
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[9,10]*6,
'B':range(23,35),
'C':range(-6,6)})
print(df)
# A B C
# 0 9 23 -6
# 1 10 24 -5
# 2 9 25 -4
# 3 10 26 -3
# 4 9 27 -2
# 5 10 28 -1
# 6 9 29 0
# 7 10 30 1
# 8 9 31 2
# 9 10 32 3
# 10 9 33 4
# 11 10 34 5
Here is our boolean index:
idx = (df['C']!=0) & (df['A']==10) & (df['B']<30)
We can modify those rows of df
where idx
is True by assigning to df.loc[idx, ...]
. For example,
df.loc[idx, 'A'] += df.loc[idx, 'B'] * df.loc[idx, 'C']
print(df)
yields
A B C
0 9 23 -6
1 -110 24 -5
2 9 25 -4
3 -68 26 -3
4 9 27 -2
5 -18 28 -1
6 9 29 0
7 10 30 1
8 9 31 2
9 10 32 3
10 9 33 4
11 10 34 5
Building off of unutbu's example you could also use the boolean index on df.index like so:
In [11]: df.ix[df.index[idx]] = 999
In [12]: df
Out[12]:
A B C
0 9 23 -6
1 999 999 999
2 9 25 -4
3 999 999 999
4 9 27 -2
5 999 999 999
6 9 29 0
7 10 30 1
8 9 31 2
9 10 32 3
10 9 33 4
11 10 34 5
The pandas docs have a section on Returning a view versus a copy:
The rules about when a view on the data is returned are entirely dependent on NumPy. Whenever an array of labels or a boolean vector are involved in the indexing operation, the result will be a copy. With single label / scalar indexing and slicing, e.g.
df.ix[3:6]
ordf.ix[:, 'A']
, a view will be returned.