In Pandas/Python, I have to write the dataframe name twice when conditioning on its own variable:
df_REPEATED[df_REPEATED[\'var\']>0]
This h
df_REPEATED['var'] > 0
is a boolean array. Other than its length, it has no connection to the DataFrame. It could have been the result of another expression, say another_df['another_var'] > some_other_value
, as long as the lengths match. So it offers flexibility. If the syntax was like the one you suggested, we couldn't do this. However, there are alternatives to what you are asking. For example,
df_REPEATED.query('var > 0')
query
can be very fast if the DataFrame is large and it is less verbose but it lacks the advantages of boolean indexing and you start having troubles if the expression gets complicated.
Not an official answer... but it already made my life simpler recently:
https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py
You don't need to download the entire repo: saving the file and doing
from where import Where as W
should suffice. Then you use it like this:
df = pd.DataFrame([[1, 2, True],
[3, 4, False],
[5, 7, True]],
index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
A slightly less stupid usage example:
data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
EDIT: this answer mentions an analogous approach not requiring external components, resulting in:
data = (pd.read_csv('ugly_db.csv')
.loc[lambda df : ~(df == '$null$').any(axis=1)])
and another possibility is to use .apply()
, as in
data = (pd.read_csv('ugly_db.csv')
.pipe(lambda df : ~(df == '$null$').any(axis=1)))