Selecting with complex criteria from pandas.DataFrame

后端 未结 4 939
孤街浪徒
孤街浪徒 2020-11-22 11:12

For example I have simple DF:

import pandas as pd
from random import randint

df = pd.DataFrame({\'A\': [randint(1, 9) for x in xrange(10)],
                         


        
相关标签:
4条回答
  • 2020-11-22 11:26

    And remember to use parenthesis!

    Keep in mind that & operator takes a precedence over operators such as > or < etc. That is why

    4 < 5 & 6 > 4
    

    evaluates to False. Therefore if you're using pd.loc, you need to put brackets around your logical statements, otherwise you get an error. That's why do:

    df.loc[(df['A'] > 10) & (df['B'] < 15)]
    

    instead of

    df.loc[df['A'] > 10 & df['B'] < 15]
    

    which would result in

    TypeError: cannot compare a dtyped [float64] array with a scalar of type [bool]

    0 讨论(0)
  • 2020-11-22 11:27

    You can use pandas it has some built in functions for comparison. So if you want to select values of "A" that are met by the conditions of "B" and "C" (assuming you want back a DataFrame pandas object)

    df[['A']][df.B.gt(50) & df.C.ne(900)]

    df[['A']] will give you back column A in DataFrame format.

    pandas 'gt' function will return the positions of column B that are greater than 50 and 'ne' will return the positions not equal to 900.

    0 讨论(0)
  • 2020-11-22 11:42

    Sure! Setup:

    >>> import pandas as pd
    >>> from random import randint
    >>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
                       'B': [randint(1, 9)*10 for x in range(10)],
                       'C': [randint(1, 9)*100 for x in range(10)]})
    >>> df
       A   B    C
    0  9  40  300
    1  9  70  700
    2  5  70  900
    3  8  80  900
    4  7  50  200
    5  9  30  900
    6  2  80  700
    7  2  80  400
    8  5  80  300
    9  7  70  800
    

    We can apply column operations and get boolean Series objects:

    >>> df["B"] > 50
    0    False
    1     True
    2     True
    3     True
    4    False
    5    False
    6     True
    7     True
    8     True
    9     True
    Name: B
    >>> (df["B"] > 50) & (df["C"] == 900)
    0    False
    1    False
    2     True
    3     True
    4    False
    5    False
    6    False
    7    False
    8    False
    9    False
    

    [Update, to switch to new-style .loc]:

    And then we can use these to index into the object. For read access, you can chain indices:

    >>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
    2    5
    3    8
    Name: A, dtype: int64
    

    but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc instead:

    >>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
    2    5
    3    8
    Name: A, dtype: int64
    >>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
    array([5, 8], dtype=int64)
    >>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
    >>> df
          A   B    C
    0     9  40  300
    1     9  70  700
    2  5000  70  900
    3  8000  80  900
    4     7  50  200
    5     9  30  900
    6     2  80  700
    7     2  80  400
    8     5  80  300
    9     7  70  800
    

    Note that I accidentally typed == 900 and not != 900, or ~(df["C"] == 900), but I'm too lazy to fix it. Exercise for the reader. :^)

    0 讨论(0)
  • 2020-11-22 11:42

    Another solution is to use the query method:

    import pandas as pd
    
    from random import randint
    df = pd.DataFrame({'A': [randint(1, 9) for x in xrange(10)],
                       'B': [randint(1, 9) * 10 for x in xrange(10)],
                       'C': [randint(1, 9) * 100 for x in xrange(10)]})
    print df
    
       A   B    C
    0  7  20  300
    1  7  80  700
    2  4  90  100
    3  4  30  900
    4  7  80  200
    5  7  60  800
    6  3  80  900
    7  9  40  100
    8  6  40  100
    9  3  10  600
    
    print df.query('B > 50 and C != 900')
    
       A   B    C
    1  7  80  700
    2  4  90  100
    4  7  80  200
    5  7  60  800
    

    Now if you want to change the returned values in column A you can save their index:

    my_query_index = df.query('B > 50 & C != 900').index
    

    ....and use .iloc to change them i.e:

    df.iloc[my_query_index, 0] = 5000
    
    print df
    
          A   B    C
    0     7  20  300
    1  5000  80  700
    2  5000  90  100
    3     4  30  900
    4  5000  80  200
    5  5000  60  800
    6     3  80  900
    7     9  40  100
    8     6  40  100
    9     3  10  600
    
    0 讨论(0)
提交回复
热议问题