How to drop rows of Pandas DataFrame whose value in a certain column is NaN

前端 未结 12 855
一生所求
一生所求 2020-11-22 00:59

I have this DataFrame and want only the records whose EPS column is not NaN:

>>> df
                 STK_ID           


        
相关标签:
12条回答
  • 2020-11-22 01:22

    In datasets having large number of columns its even better to see how many columns contain null values and how many don't.

    print("No. of columns containing null values")
    print(len(df.columns[df.isna().any()]))
    
    print("No. of columns not containing null values")
    print(len(df.columns[df.notna().all()]))
    
    print("Total no. of columns in the dataframe")
    print(len(df.columns))
    

    For example in my dataframe it contained 82 columns, of which 19 contained at least one null value.

    Further you can also automatically remove cols and rows depending on which has more null values
    Here is the code which does this intelligently:

    df = df.drop(df.columns[df.isna().sum()>len(df.columns)],axis = 1)
    df = df.dropna(axis = 0).reset_index(drop=True)
    

    Note: Above code removes all of your null values. If you want null values, process them before.

    0 讨论(0)
  • 2020-11-22 01:23

    Simple and easy way

    df.dropna(subset=['EPS'],inplace=True)

    source: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html

    0 讨论(0)
  • 2020-11-22 01:30

    This question is already resolved, but...

    ...also consider the solution suggested by Wouter in his original comment. The ability to handle missing data, including dropna(), is built into pandas explicitly. Aside from potentially improved performance over doing it manually, these functions also come with a variety of options which may be useful.

    In [24]: df = pd.DataFrame(np.random.randn(10,3))
    
    In [25]: df.iloc[::2,0] = np.nan; df.iloc[::4,1] = np.nan; df.iloc[::3,2] = np.nan;
    
    In [26]: df
    Out[26]:
              0         1         2
    0       NaN       NaN       NaN
    1  2.677677 -1.466923 -0.750366
    2       NaN  0.798002 -0.906038
    3  0.672201  0.964789       NaN
    4       NaN       NaN  0.050742
    5 -1.250970  0.030561 -2.678622
    6       NaN  1.036043       NaN
    7  0.049896 -0.308003  0.823295
    8       NaN       NaN  0.637482
    9 -0.310130  0.078891       NaN
    

    In [27]: df.dropna()     #drop all rows that have any NaN values
    Out[27]:
              0         1         2
    1  2.677677 -1.466923 -0.750366
    5 -1.250970  0.030561 -2.678622
    7  0.049896 -0.308003  0.823295
    

    In [28]: df.dropna(how='all')     #drop only if ALL columns are NaN
    Out[28]:
              0         1         2
    1  2.677677 -1.466923 -0.750366
    2       NaN  0.798002 -0.906038
    3  0.672201  0.964789       NaN
    4       NaN       NaN  0.050742
    5 -1.250970  0.030561 -2.678622
    6       NaN  1.036043       NaN
    7  0.049896 -0.308003  0.823295
    8       NaN       NaN  0.637482
    9 -0.310130  0.078891       NaN
    

    In [29]: df.dropna(thresh=2)   #Drop row if it does not have at least two values that are **not** NaN
    Out[29]:
              0         1         2
    1  2.677677 -1.466923 -0.750366
    2       NaN  0.798002 -0.906038
    3  0.672201  0.964789       NaN
    5 -1.250970  0.030561 -2.678622
    7  0.049896 -0.308003  0.823295
    9 -0.310130  0.078891       NaN
    

    In [30]: df.dropna(subset=[1])   #Drop only if NaN in specific column (as asked in the question)
    Out[30]:
              0         1         2
    1  2.677677 -1.466923 -0.750366
    2       NaN  0.798002 -0.906038
    3  0.672201  0.964789       NaN
    5 -1.250970  0.030561 -2.678622
    6       NaN  1.036043       NaN
    7  0.049896 -0.308003  0.823295
    9 -0.310130  0.078891       NaN
    

    There are also other options (See docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html), including dropping columns instead of rows.

    Pretty handy!

    0 讨论(0)
  • 2020-11-22 01:33

    yet another solution which uses the fact that np.nan != np.nan:

    In [149]: df.query("EPS == EPS")
    Out[149]:
                     STK_ID  EPS  cash
    STK_ID RPT_Date
    600016 20111231  600016  4.3   NaN
    601939 20111231  601939  2.5   NaN
    
    0 讨论(0)
  • 2020-11-22 01:35

    You could use dataframe method notnull or inverse of isnull, or numpy.isnan:

    In [332]: df[df.EPS.notnull()]
    Out[332]:
       STK_ID  RPT_Date  STK_ID.1  EPS  cash
    2  600016  20111231    600016  4.3   NaN
    4  601939  20111231    601939  2.5   NaN
    
    
    In [334]: df[~df.EPS.isnull()]
    Out[334]:
       STK_ID  RPT_Date  STK_ID.1  EPS  cash
    2  600016  20111231    600016  4.3   NaN
    4  601939  20111231    601939  2.5   NaN
    
    
    In [347]: df[~np.isnan(df.EPS)]
    Out[347]:
       STK_ID  RPT_Date  STK_ID.1  EPS  cash
    2  600016  20111231    600016  4.3   NaN
    4  601939  20111231    601939  2.5   NaN
    
    0 讨论(0)
  • 2020-11-22 01:36

    How to drop rows of Pandas DataFrame whose value in a certain column is NaN

    This is an old question which has been beaten to death but I do believe there is some more useful information to be surfaced on this thread. Read on if you're looking for the answer to any of the following questions:

    • Can I drop rows if any of its values have NaNs? What about if all of them are NaN?
    • Can I only look at NaNs in specific columns when dropping rows?
    • Can I drop rows with a specific count of NaN values?
    • How do I drop columns instead of rows?
    • I tried all of the options above but my DataFrame just won't update!

    DataFrame.dropna: Usage, and Examples

    It's already been said that df.dropna is the canonical method to drop NaNs from DataFrames, but there's nothing like a few visual cues to help along the way.

    # Setup
    df = pd.DataFrame({
        'A': [np.nan, 2, 3, 4],  
        'B': [np.nan, np.nan, 2, 3], 
        'C': [np.nan]*3 + [3]}) 
    
    df                      
         A    B    C
    0  NaN  NaN  NaN
    1  2.0  NaN  NaN
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    

    Below is a detail of the most important arguments and how they work, arranged in an FAQ format.


    Can I drop rows if any of its values have NaNs? What about if all of them are NaN?

    This is where the how=... argument comes in handy. It can be one of

    • 'any' (default) - drops rows if at least one column has NaN
    • 'all' - drops rows only if all of its columns have NaNs

    <!_ ->

    # Removes all but the last row since there are no NaNs 
    df.dropna()
    
         A    B    C
    3  4.0  3.0  3.0
    
    # Removes the first row only
    df.dropna(how='all')
    
         A    B    C
    1  2.0  NaN  NaN
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    

    Note
    If you just want to see which rows are null (IOW, if you want a boolean mask of rows), use isna:

    df.isna()
    
           A      B      C
    0   True   True   True
    1  False   True   True
    2  False  False   True
    3  False  False  False
    
    df.isna().any(axis=1)
    
    0     True
    1     True
    2     True
    3    False
    dtype: bool
    

    To get the inversion of this result, use notna instead.


    Can I only look at NaNs in specific columns when dropping rows?

    This is a use case for the subset=[...] argument.

    Specify a list of columns (or indexes with axis=1) to tells pandas you only want to look at these columns (or rows with axis=1) when dropping rows (or columns with axis=1.

    # Drop all rows with NaNs in A
    df.dropna(subset=['A'])
    
         A    B    C
    1  2.0  NaN  NaN
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    
    # Drop all rows with NaNs in A OR B
    df.dropna(subset=['A', 'B'])
    
         A    B    C
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    

    Can I drop rows with a specific count of NaN values?

    This is a use case for the thresh=... argument. Specify the minimum number of NON-NULL values as an integer.

    df.dropna(thresh=1)  
    
         A    B    C
    1  2.0  NaN  NaN
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    
    df.dropna(thresh=2)
    
         A    B    C
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    
    df.dropna(thresh=3)
    
         A    B    C
    3  4.0  3.0  3.0
    

    The thing to note here is you need to specify how many NON-NULL values you want to keep, rather than how many NULL values you want to drop. This is a pain point for new users.

    Luckily the fix is easy: if you have a count of NULL values, simply subtract it from the column size to get the correct thresh argument for the function.

    required_min_null_values_to_drop = 2 # drop rows with at least 2 NaN
    df.dropna(thresh=df.shape[1] - required_min_null_values_to_drop + 1)
    
         A    B    C
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    

    How do I drop columns instead of rows?

    Use the axis=... argument, it can be axis=0 or axis=1.

    Tells the function whether you want to drop rows (axis=0) or drop columns (axis=1).

    df.dropna()
    
         A    B    C
    3  4.0  3.0  3.0
    
    # All columns have rows, so the result is empty.
    df.dropna(axis=1)
    
    Empty DataFrame
    Columns: []
    Index: [0, 1, 2, 3]
    
    # Here's a different example requiring the column to have all NaN rows
    # to be dropped. In this case no columns satisfy the condition.
    df.dropna(axis=1, how='all')
    
         A    B    C
    0  NaN  NaN  NaN
    1  2.0  NaN  NaN
    2  3.0  2.0  NaN
    3  4.0  3.0  3.0
    
    # Here's a different example requiring a column to have at least 2 NON-NULL
    # values. Column C has less than 2 NON-NULL values, so it should be dropped.
    df.dropna(axis=1, thresh=2)
    
         A    B
    0  NaN  NaN
    1  2.0  NaN
    2  3.0  2.0
    3  4.0  3.0
    

    I tried all of the options above but my DataFrame just won't update!

    dropna, like most other functions in the pandas API returns a new DataFrame (a copy of the original with changes) as the result, so you should assign it back if you want to see changes.

    df.dropna(...) # wrong
    df.dropna(..., inplace=True) # right, but not recommended
    df = df.dropna(...) # right
    

    Reference

    https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

    DataFrame.dropna(
        self, axis=0, how='any', thresh=None, subset=None, inplace=False)
    

    0 讨论(0)
提交回复
热议问题