How to select rows with one or more nulls from a pandas DataFrame without listing columns explicitly?

后端 未结 5 1828
滥情空心
滥情空心 2020-11-28 00:37

I have a dataframe with ~300K rows and ~40 columns. I want to find out if any rows contain null values - and put these \'null\'-rows into a separate dataframe so that I coul

相关标签:
5条回答
  • 2020-11-28 01:06

    .any() and .all() are great for the extreme cases, but not when you're looking for a specific number of null values. Here's an extremely simple way to do what I believe you're asking. It's pretty verbose, but functional.

    import pandas as pd
    import numpy as np
    
    # Some test data frame
    df = pd.DataFrame({'num_legs':          [2, 4,      np.nan, 0, np.nan],
                       'num_wings':         [2, 0,      np.nan, 0, 9],
                       'num_specimen_seen': [10, np.nan, 1,     8, np.nan]})
    
    # Helper : Gets NaNs for some row
    def row_nan_sums(df):
        sums = []
        for row in df.values:
            sum = 0
            for el in row:
                if el != el: # np.nan is never equal to itself. This is "hacky", but complete.
                    sum+=1
            sums.append(sum)
        return sums
    
    # Returns a list of indices for rows with k+ NaNs
    def query_k_plus_sums(df, k):
        sums = row_nan_sums(df)
        indices = []
        i = 0
        for sum in sums:
            if (sum >= k):
                indices.append(i)
            i += 1
        return indices
    
    # test
    print(df)
    print(query_k_plus_sums(df, 2))
    

    Output

       num_legs  num_wings  num_specimen_seen
    0       2.0        2.0               10.0
    1       4.0        0.0                NaN
    2       NaN        NaN                1.0
    3       0.0        0.0                8.0
    4       NaN        9.0                NaN
    [2, 4]
    

    Then, if you're like me and want to clear those rows out, you just write this:

    # drop the rows from the data frame
    df.drop(query_k_plus_sums(df, 2),inplace=True)
    # Reshuffle up data (if you don't do this, the indices won't reset)
    df = df.sample(frac=1).reset_index(drop=True)
    # print data frame
    print(df)
    

    Output:

       num_legs  num_wings  num_specimen_seen
    0       4.0        0.0                NaN
    1       0.0        0.0                8.0
    2       2.0        2.0               10.0
    
    0 讨论(0)
  • 2020-11-28 01:14

    [Updated to adapt to modern pandas, which has isnull as a method of DataFrames..]

    You can use isnull and any to build a boolean Series and use that to index into your frame:

    >>> df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
    >>> df.isnull()
           0      1      2
    0  False  False  False
    1  False   True  False
    2  False  False   True
    3  False  False  False
    4  False  False  False
    >>> df.isnull().any(axis=1)
    0    False
    1     True
    2     True
    3    False
    4    False
    dtype: bool
    >>> df[df.isnull().any(axis=1)]
       0   1   2
    1  0 NaN   0
    2  0   0 NaN
    

    [For older pandas:]

    You could use the function isnull instead of the method:

    In [56]: df = pd.DataFrame([range(3), [0, np.NaN, 0], [0, 0, np.NaN], range(3), range(3)])
    
    In [57]: df
    Out[57]: 
       0   1   2
    0  0   1   2
    1  0 NaN   0
    2  0   0 NaN
    3  0   1   2
    4  0   1   2
    
    In [58]: pd.isnull(df)
    Out[58]: 
           0      1      2
    0  False  False  False
    1  False   True  False
    2  False  False   True
    3  False  False  False
    4  False  False  False
    
    In [59]: pd.isnull(df).any(axis=1)
    Out[59]: 
    0    False
    1     True
    2     True
    3    False
    4    False
    

    leading to the rather compact:

    In [60]: df[pd.isnull(df).any(axis=1)]
    Out[60]: 
       0   1   2
    1  0 NaN   0
    2  0   0 NaN
    
    0 讨论(0)
  • 2020-11-28 01:20

    If you want to filter rows by a certain number of columns with null values, you may use this:

    df.iloc[df[(df.isnull().sum(axis=1) >= qty_of_nuls)].index]
    

    So, here is the example:

    Your dataframe:

    >>> df = pd.DataFrame([range(4), [0, np.NaN, 0, np.NaN], [0, 0, np.NaN, 0], range(4), [np.NaN, 0, np.NaN, np.NaN]])
    >>> df
         0    1    2    3
    0  0.0  1.0  2.0  3.0
    1  0.0  NaN  0.0  NaN
    2  0.0  0.0  NaN  0.0
    3  0.0  1.0  2.0  3.0
    4  NaN  0.0  NaN  NaN
    

    If you want to select the rows that have two or more columns with null value, you run the following:

    >>> qty_of_nuls = 2
    >>> df.iloc[df[(df.isnull().sum(axis=1) >=qty_of_nuls)].index]
         0    1    2   3
    1  0.0  NaN  0.0 NaN
    4  NaN  0.0  NaN NaN
    
    0 讨论(0)
  • 2020-11-28 01:27
    def nans(df): return df[df.isnull().any(axis=1)]
    

    then when ever you need it you can type:

    nans(your_dataframe)
    
    0 讨论(0)
  • 2020-11-28 01:30

    Four fewer characters, but 2 more ms

    %%timeit
    df.isna().T.any()
    # 52.4 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    %%timeit
    df.isna().any(axis=1)
    # 50 ms ± 423 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    I'd probably use axis=1

    0 讨论(0)
提交回复
热议问题