Alternatives to awkward Pandas/Python Dataframe Indexing: df_REPEATED[df_REPEATED['var']]>0?

后端未结

关注

 2  959

In Pandas/Python, I have to write the dataframe name twice when conditioning on its own variable:

df_REPEATED[df_REPEATED[\'var\']>0]

This h

相关标签:

2条回答

北恋

2021-01-24 02:12
df_REPEATED['var'] > 0 is a boolean array. Other than its length, it has no connection to the DataFrame. It could have been the result of another expression, say another_df['another_var'] > some_other_value, as long as the lengths match. So it offers flexibility. If the syntax was like the one you suggested, we couldn't do this. However, there are alternatives to what you are asking. For example,
```
df_REPEATED.query('var > 0')
```
query can be very fast if the DataFrame is large and it is less verbose but it lacks the advantages of boolean indexing and you start having troubles if the expression gets complicated.
0 讨论(0)
发布评论:

提交评论
- 加载中...

灰色年华

2021-01-24 02:21

Not an official answer... but it already made my life simpler recently:

https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

You don't need to download the entire repo: saving the file and doing

from where import Where as W

should suffice. Then you use it like this:

df = pd.DataFrame([[1, 2, True],
                   [3, 4, False], 
                   [5, 7, True]],
                  index=range(3), columns=['a', 'b', 'c'])
# On specific column:
print(df.loc[W['a'] > 2])
print(df.loc[-W['a'] == W['b']])
print(df.loc[~W['c']])
# On entire DataFrame:
print(df.loc[W.sum(axis=1) > 3])
print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])

A slightly less stupid usage example:

data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]

EDIT: this answer mentions an analogous approach not requiring external components, resulting in:

data = (pd.read_csv('ugly_db.csv')
          .loc[lambda df : ~(df == '$null$').any(axis=1)])

and another possibility is to use .apply(), as in

data = (pd.read_csv('ugly_db.csv')
          .pipe(lambda df : ~(df == '$null$').any(axis=1)))

0 讨论(0)