I\'m trying to set a number of different in a pandas DataFrame all to the same value. I thought I understood boolean indexing for pandas, but I haven\'t found any resources
pandas
uses NaN
to mark invalid or missing data and can be used across types, since your DataFrame
as mixed int and string data types it will not accept the assignment to a single type (other than NaN
) as this would create a mixed type (int and str) in B
through an in-place assignment.
@JohnE method using np.where
creates a new DataFrame
in which the type of column B
is an object not a string as in the initial example.
You can't use the boolean mask on mixed dtypes for this unfortunately, you can use pandas where
to set the values:
In [59]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df = df.where(mask, other=30)
df
Out[59]:
A B
0 1 a
1 30 30
2 3 30
Note: that the above will fail if you do inplace=True
in the where
method, so df.where(mask, other=30, inplace=True)
will raise:
TypeError: Cannot do inplace boolean setting on mixed-types with a non np.nan value
EDIT
OK, after a little misunderstanding you can still use where
y just inverting the mask:
In [2]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': ['a', 'b', 'f']})
mask = df.isin([1, 3, 12, 'a'])
df.where(~mask, other=30)
Out[2]:
A B
0 30 30
1 2 b
2 30 f
I'm not 100% sure but I suspect the error message relates to the fact that there is not identical treatment of missing data across different dtypes. Only float has NaN, but integers can be automatically converted to floats so it's not a problem there. But it appears mixing number dtypes and object dtypes does not work so easily...
Regardless of that, you could get around it pretty easily with np.where
:
df[:] = np.where( mask, 30, df )
A B
0 30 30
1 2 b
2 30 f
If you want to use different columns to create your mask, you need to call the values
property of the dataframe.
Let's say we want to, replace values in A_1
and 'A_2' according to a mask in B_1
and B_2
. For example, replace those values in A
(to 999) that corresponds to nulls in B
.
The original dataframe:
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 5 n NaN
2 3 6 NaN NaN
The desired dataframe
A_1 A_2 B_1 B_2
0 1 4 y n
1 2 999 n NaN
2 999 999 NaN NaN
The code:
df = pd.DataFrame({
'A_1': [1, 2, 3],
'A_2': [4, 5, 6],
'B_1': ['y', 'n', np.nan],
'B_2': ['n', np.nan, np.nan]})
_mask = df[['B_1', 'B_2']].notnull().values
df[['A_1', 'A_2']] = df[['A_1','A_2']].where(_mask, other=999)
A_1 A_2
0 1 4
1 2 999
2 999 999