Filter out rows with more than certain number of NaN

前端 未结 3 718
谎友^
谎友^ 2020-12-09 06:09

In a Pandas dataframe, I would like to filter out all the rows that have more than 2 NaNs.

Essentially, I have 4 columns and I would like to keep only t

相关标签:
3条回答
  • 2020-12-09 06:49

    I had a slightly different problem i.e. to filter out columns with more than certain number of NaN:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'a':[1,2,np.nan,4,5], 'b':[np.nan,2,np.nan,4,5], 'c':[1,2,np.nan,np.nan,np.nan], 'd':[1,2,3,np.nan,5]})
    df
    
        a   b   c   d
    0   1.0 NaN 1.0 1.0
    1   2.0 2.0 2.0 2.0
    2   NaN NaN NaN 3.0
    3   4.0 4.0 NaN NaN
    4   5.0 5.0 NaN 5.0
    

    Assume you want to filter out columns with 3 or more Nan's:

    num_rows = df.shape[0]
    drop_cols_with_this_amount_of_nans_or_more = 3
    keep_cols_with_at_least_this_number_of_non_nans = num_rows - drop_cols_with_this_amount_of_nans_or_more + 1
    
    df.dropna(axis=1,thresh=keep_cols_with_at_least_this_number_of_non_nans)
    

    output: (column c has been dropped as expected):

        a   b   d
    0   1.0 NaN 1.0
    1   2.0 2.0 2.0
    2   NaN NaN 3.0
    3   4.0 4.0 NaN
    4   5.0 5.0 5.0
    
    0 讨论(0)
  • 2020-12-09 06:53

    You have phrased 2 slightly different questions here. In the general case, they have different answers.

    I would like to keep only those rows where at least 2 columns have finite values.

    df = df.dropna(thresh=2)
    

    This keeps rows with 2 or more non-null values.


    I would like to filter out all the rows that have more than 2 NaNs

    df = df.dropna(thresh=df.shape[1]-2)
    

    This filters out rows with 2 or more null values.

    In your example dataframe of 4 columns, these operations are equivalent, since df.shape[1] - 2 == 2. However, you will notice discrepancies with dataframes which do not have exactly 4 columns.


    Note dropna also has a subset argument should you wish to include only specified columns when applying a threshold. For example:

    df = df.dropna(subset=['col1', 'col2', 'col3'], thresh=2)
    
    0 讨论(0)
  • 2020-12-09 07:07

    The following should work

    df.dropna(thresh=2)
    

    See the online docs

    What we are doing here is dropping any NaN rows, where there are 2 or more non NaN values in a row.

    Example:

    In [25]:
    
    import pandas as pd
    
    df = pd.DataFrame({'a':[1,2,NaN,4,5], 'b':[NaN,2,NaN,4,5], 'c':[1,2,NaN,NaN,NaN], 'd':[1,2,3,NaN,5]})
    
    df
    
    Out[25]:
    
        a   b   c   d
    0   1 NaN   1   1
    1   2   2   2   2
    2 NaN NaN NaN   3
    3   4   4 NaN NaN
    4   5   5 NaN   5
    
    [5 rows x 4 columns]
    
    In [26]:
    
    df.dropna(thresh=2)
    
    Out[26]:
    
       a   b   c   d
    0  1 NaN   1   1
    1  2   2   2   2
    3  4   4 NaN NaN
    4  5   5 NaN   5
    
    [4 rows x 4 columns]
    

    EDIT

    For the above example it works but you should note that you would have to know the number of columns and set the thresh value appropriately, I thought originally it meant the number of NaN values but it actually means number of Non NaN values.

    0 讨论(0)
提交回复
热议问题