Alternatives to awkward Pandas/Python Dataframe Indexing: df_REPEATED[df_REPEATED['var']]>0?

后端 未结 2 959
感动是毒
感动是毒 2021-01-24 01:25

In Pandas/Python, I have to write the dataframe name twice when conditioning on its own variable:

df_REPEATED[df_REPEATED[\'var\']>0]

This h

相关标签:
2条回答
  • 2021-01-24 02:12

    df_REPEATED['var'] > 0 is a boolean array. Other than its length, it has no connection to the DataFrame. It could have been the result of another expression, say another_df['another_var'] > some_other_value, as long as the lengths match. So it offers flexibility. If the syntax was like the one you suggested, we couldn't do this. However, there are alternatives to what you are asking. For example,

    df_REPEATED.query('var > 0')
    

    query can be very fast if the DataFrame is large and it is less verbose but it lacks the advantages of boolean indexing and you start having troubles if the expression gets complicated.

    0 讨论(0)
  • 2021-01-24 02:21

    Not an official answer... but it already made my life simpler recently:

    https://github.com/toobaz/generic_utils/blob/master/generic_utils/pandas/where.py

    You don't need to download the entire repo: saving the file and doing

    from where import Where as W
    

    should suffice. Then you use it like this:

    df = pd.DataFrame([[1, 2, True],
                       [3, 4, False], 
                       [5, 7, True]],
                      index=range(3), columns=['a', 'b', 'c'])
    # On specific column:
    print(df.loc[W['a'] > 2])
    print(df.loc[-W['a'] == W['b']])
    print(df.loc[~W['c']])
    # On entire DataFrame:
    print(df.loc[W.sum(axis=1) > 3])
    print(df.loc[W[['a', 'b']].diff(axis=1)['b'] > 1])
    

    A slightly less stupid usage example:

    data = pd.read_csv('ugly_db.csv').loc[~(W == '$null$').any(axis=1)]
    

    EDIT: this answer mentions an analogous approach not requiring external components, resulting in:

    data = (pd.read_csv('ugly_db.csv')
              .loc[lambda df : ~(df == '$null$').any(axis=1)])
    

    and another possibility is to use .apply(), as in

    data = (pd.read_csv('ugly_db.csv')
              .pipe(lambda df : ~(df == '$null$').any(axis=1)))
    
    0 讨论(0)
提交回复
热议问题