Return dataframe subset based on a list of boolean values

前端 未结 6 2233
星月不相逢
星月不相逢 2021-02-13 04:38

I\'m trying to slice a dataframe based on list of values, how would I go about this?

Say I have an expression or a list l = [0,1,0,0,1,1,0,0,0,1]

Ho

相关标签:
6条回答
  • 2021-02-13 05:14

    Convert the list to a boolean array and then use boolean indexing:

    df = pd.DataFrame(np.random.randint(10, size=(10, 3)))
    
    df[np.array(lst).astype(bool)]
    Out: 
       0  1  2
    1  8  6  3
    4  2  7  3
    5  7  2  3
    9  1  3  4
    
    0 讨论(0)
  • 2021-02-13 05:16

    yet another "creative" approach:

    In [181]: a = np.array(lst)
    
    In [182]: df.query("index * @a > 0")
    Out[182]:
       0  1  2
    1  1  5  5
    4  0  2  0
    5  4  9  9
    9  2  2  5
    

    or much better variant from @ayhan:

    In [183]: df.query("@a != 0")
    Out[183]:
       0  1  2
    1  1  5  5
    4  0  2  0
    5  4  9  9
    9  2  2  5
    

    PS i've also borrowed @Ayhan's setup

    0 讨论(0)
  • 2021-02-13 05:22

    Or maybe find the position of 1 in your list and slice from the Dataframe

    df.loc[[i for i,x in enumerate(lst) if x == 1],:]
    
    0 讨论(0)
  • 2021-02-13 05:26

    Selecting using a list of Booleans is something itertools.compress does well.

    Given

    >>> df = pd.DataFrame(np.random.randint(10, size=(10, 2)))
    >>> selectors = [0, 1, 0, 0, 1, 1, 0, 0, 0, 1]
    

    Code

    >>> selected_idxs = list(itertools.compress(df.index, selectors))   # [1, 4, 5, 9]
    >>> df.iloc[selected_idxs, :]
       0  1
    1  1  9
    4  3  4
    5  4  1
    9  8  9
    
    0 讨论(0)
  • 2021-02-13 05:28

    Setup
    Borrowed @ayhan's setup

    df = pd.DataFrame(np.random.randint(10, size=(10, 3)))
    

    Without numpy
    not the fastest, but it holds its own and is definitely the shortest.

    df[list(map(bool, lst))]
    
       0  1  2
    1  3  5  6
    4  6  3  2
    5  5  7  6
    9  0  0  1
    

    Timing

    results.div(results.min(1), 0).round(2).pipe(lambda d: d.assign(Best=d.idxmin(1)))
    
             ayh   wvo   pir   mxu   wen Best
    N                                        
    1       1.53  1.00  1.02  4.95  2.61  wvo
    3       1.06  1.00  1.04  5.46  2.84  wvo
    10      1.00  1.00  1.00  4.30  2.73  ayh
    30      1.00  1.05  1.24  4.06  3.76  ayh
    100     1.16  1.00  1.19  3.90  3.53  wvo
    300     1.29  1.00  1.32  2.50  2.38  wvo
    1000    1.54  1.00  2.19  2.24  3.85  wvo
    3000    1.39  1.00  2.17  1.81  4.55  wvo
    10000   1.22  1.00  2.21  1.35  4.36  wvo
    30000   1.19  1.00  2.26  1.39  5.36  wvo
    100000  1.19  1.00  2.19  1.31  4.82  wvo
    

    fig, (a1, a2) = plt.subplots(2, 1, figsize=(6, 6))
    results.plot(loglog=True, lw=3, ax=a1)
    results.div(results.min(1), 0).round(2).plot.bar(logy=True, ax=a2)
    fig.tight_layout()
    


    Testing Code

    ayh = lambda d, l: d[np.array(l).astype(bool)]
    wvo = lambda d, l: d[np.array(l, dtype=bool)]
    pir = lambda d, l: d[list(map(bool, l))]
    wen = lambda d, l: d.loc[[i for i, x in enumerate(l) if x == 1], :]
    
    def mxu(d, l):
        a = np.array(l)
        return d.query('@a != 0')
    
    results = pd.DataFrame(
        index=pd.Index([1, 3, 10, 30, 100, 300,
                        1000, 3000, 10000, 30000, 100000], name='N'),
        columns='ayh wvo pir mxu wen'.split(),
        dtype=float
    )
    
    for i in results.index:
        d = pd.concat([df] * i, ignore_index=True)
        l = lst * i
        for j in results.columns:
            stmt = '{}(d, l)'.format(j)
            setp = 'from __main__ import d, l, {}'.format(j)
            results.set_value(i, j, timeit(stmt, setp, number=10))
    
    0 讨论(0)
  • 2021-02-13 05:36

    You can use masking here:

    df[np.array([0,1,0,0,1,1,0,0,0,1],dtype=bool)]
    

    So we construct a boolean array with true and false. Every place where the array is True is a row we select.

    Mind that we do not filter inplace. In order to retrieve the result, you have to assign the result to an (optionally different) variable:

    df2 = df[np.array([0,1,0,0,1,1,0,0,0,1],dtype=bool)]
    
    0 讨论(0)
提交回复
热议问题