Getting a list of indices where pandas boolean series is True

前端 未结 2 396
栀梦
栀梦 2020-12-05 17:36

I have a pandas series with boolean entries. I would like to get a list of indices where the values are True.

For example the input pd.Series([Tr

相关标签:
2条回答
  • 2020-12-05 18:05

    Using Boolean Indexing

    >>> s = pd.Series([True, False, True, True, False, False, False, True])
    >>> s[s].index
    Int64Index([0, 2, 3, 7], dtype='int64')
    

    If need a np.array object, get the .values

    >>> s[s].index.values
    array([0, 2, 3, 7])
    

    Using np.nonzero

    >>> np.nonzero(s)
    (array([0, 2, 3, 7]),)
    

    Using np.flatnonzero

    >>> np.flatnonzero(s)
    array([0, 2, 3, 7])
    

    Using np.where

    >>> np.where(s)[0]
    array([0, 2, 3, 7])
    

    Using np.argwhere

    >>> np.argwhere(s).ravel()
    array([0, 2, 3, 7])
    

    Using pd.Series.index

    >>> s.index[s]
    array([0, 2, 3, 7])
    

    Using python's built-in filter

    >>> [*filter(s.get, s.index)]
    [0, 2, 3, 7]
    

    Using list comprehension

    >>> [i for i in s.index if s[i]]
    [0, 2, 3, 7]
    
    0 讨论(0)
  • 2020-12-05 18:09

    As an addition to rafaelc's answer, here are the according times (from quickest to slowest) for the following setup

    import numpy as np
    import pandas as pd
    s = pd.Series([x > 0.5 for x in np.random.random(size=1000)])
    

    Using np.where

    >>> timeit np.where(s)[0]
    12.7 µs ± 77.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Using np.flatnonzero

    >>> timeit np.flatnonzero(s)
    18 µs ± 508 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Using pd.Series.index

    The time difference to boolean indexing was really surprising to me, since the boolean indexing is usually more used.

    >>> timeit s.index[s]
    82.2 µs ± 38.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    Using Boolean Indexing

    >>> timeit s[s].index
    1.75 ms ± 2.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    If you need a np.array object, get the .values

    >>> timeit s[s].index.values
    1.76 ms ± 3.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    If you need a slightly easier to read version <-- not in original answer

    >>> timeit s[s==True].index
    1.89 ms ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    Using pd.Series.where <-- not in original answer

    >>> timeit s.where(s).dropna().index
    2.22 ms ± 3.32 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    >>> timeit s.where(s == True).dropna().index
    2.37 ms ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using pd.Series.mask <-- not in original answer

    >>> timeit s.mask(s).dropna().index
    2.29 ms ± 1.43 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    >>> timeit s.mask(s == True).dropna().index
    2.44 ms ± 5.82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using list comprehension

    >>> timeit [i for i in s.index if s[i]]
    13.7 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Using python's built-in filter

    >>> timeit [*filter(s.get, s.index)]
    14.2 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    

    Using np.nonzero <-- did not work out of the box for me

    >>> timeit np.nonzero(s)
    ValueError: Length of passed values is 1, index implies 1000.
    

    Using np.argwhere <-- did not work out of the box for me

    >>> timeit np.argwhere(s).ravel()
    ValueError: Length of passed values is 1, index implies 1000.
    

    0 讨论(0)
提交回复
热议问题