Drop rows with all zeros in pandas data frame

前端 未结 13 938
礼貌的吻别
礼貌的吻别 2020-11-27 11:57

I can use pandas dropna() functionality to remove rows with some or all columns set as NA\'s. Is there an equivalent function for drop

相关标签:
13条回答
  • 2020-11-27 12:37

    It turns out this can be nicely expressed in a vectorized fashion:

    > df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
    > df = df[(df.T != 0).any()]
    > df
       a  b
    1  0  1
    2  1  0
    3  1  1
    
    0 讨论(0)
  • 2020-11-27 12:38

    Another alternative:

    # Is there anything in this row non-zero?
    # df != 0 --> which entries are non-zero? T/F
    # (df != 0).any(axis=1) --> are there 'any' entries non-zero row-wise? T/F of rows that return true to this statement.
    # df.loc[all_zero_mask,:] --> mask your rows to only show the rows which contained a non-zero entry.
    # df.shape to confirm a subset.
    
    all_zero_mask=(df != 0).any(axis=1) # Is there anything in this row non-zero?
    df.loc[all_zero_mask,:].shape
    
    0 讨论(0)
  • 2020-11-27 12:39

    Couple of solutions I found to be helpful while looking this up, especially for larger data sets:

    df[(df.sum(axis=1) != 0)]       # 30% faster 
    df[df.values.sum(axis=1) != 0]  # 3X faster 
    

    Continuing with the example from @U2EF1:

    In [88]: df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
    
    In [91]: %timeit df[(df.T != 0).any()]
    1000 loops, best of 3: 686 µs per loop
    
    In [92]: df[(df.sum(axis=1) != 0)]
    Out[92]: 
       a  b
    1  0  1
    2  1  0
    3  1  1
    
    In [95]: %timeit df[(df.sum(axis=1) != 0)]
    1000 loops, best of 3: 495 µs per loop
    
    In [96]: %timeit df[df.values.sum(axis=1) != 0]
    1000 loops, best of 3: 217 µs per loop
    

    On a larger dataset:

    In [119]: bdf = pd.DataFrame(np.random.randint(0,2,size=(10000,4)))
    
    In [120]: %timeit bdf[(bdf.T != 0).any()]
    1000 loops, best of 3: 1.63 ms per loop
    
    In [121]: %timeit bdf[(bdf.sum(axis=1) != 0)]
    1000 loops, best of 3: 1.09 ms per loop
    
    In [122]: %timeit bdf[bdf.values.sum(axis=1) != 0]
    1000 loops, best of 3: 517 µs per loop
    
    0 讨论(0)
  • 2020-11-27 12:39

    You can use a quick lambda function to check if all the values in a given row are 0. Then you can use the result of applying that lambda as a way to choose only the rows that match or don't match that condition:

    import pandas as pd
    import numpy as np
    
    np.random.seed(0)
    
    df = pd.DataFrame(np.random.randn(5,3), 
                      index=['one', 'two', 'three', 'four', 'five'],
                      columns=list('abc'))
    
    df.loc[['one', 'three']] = 0
    
    print df
    print df.loc[~df.apply(lambda row: (row==0).all(), axis=1)]
    

    Yields:

                  a         b         c
    one    0.000000  0.000000  0.000000
    two    2.240893  1.867558 -0.977278
    three  0.000000  0.000000  0.000000
    four   0.410599  0.144044  1.454274
    five   0.761038  0.121675  0.443863
    
    [5 rows x 3 columns]
                 a         b         c
    two   2.240893  1.867558 -0.977278
    four  0.410599  0.144044  1.454274
    five  0.761038  0.121675  0.443863
    
    [3 rows x 3 columns]
    
    0 讨论(0)
  • 2020-11-27 12:42

    To drop all columns with values 0 in any row:

    new_df = df[df.loc[:]!=0].dropna()
    
    0 讨论(0)
  • 2020-11-27 12:44

    I look up this question about once a month and always have to dig out the best answer from the comments:

    df.loc[(df!=0).any(1)]
    

    Thanks Dan Allan!

    0 讨论(0)
提交回复
热议问题