Find out the percentage of missing values in each column in the given dataset

前端 未结 11 1118
逝去的感伤
逝去的感伤 2021-01-31 08:38
import pandas as pd
df = pd.read_csv(\'https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0\')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / l         


        
相关标签:
11条回答
  • 2021-01-31 09:31

    By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.

    Input:

    In [1]:
    
    all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
    all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
    missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
    missing_data.head(20)
    

    Output :

    Out[1]: 
                                    Missing Ratio
     left_eyebrow_outer_end_x       68.435239
     left_eyebrow_outer_end_y       68.435239
     right_eyebrow_outer_end_y      68.279189
     right_eyebrow_outer_end_x      68.279189
     left_eye_outer_corner_x        67.839410
     left_eye_outer_corner_y        67.839410
     right_eye_inner_corner_x       67.825223
     right_eye_inner_corner_y       67.825223
     right_eye_outer_corner_x       67.825223
     right_eye_outer_corner_y       67.825223
     mouth_left_corner_y            67.811037
     mouth_left_corner_x            67.811037
     left_eyebrow_inner_end_x       67.796851
     left_eyebrow_inner_end_y       67.796851
     right_eyebrow_inner_end_y      67.796851
     mouth_right_corner_x           67.796851
     mouth_right_corner_y           67.796851
     right_eyebrow_inner_end_x      67.796851
     left_eye_inner_corner_x        67.782664
     left_eye_inner_corner_y        67.782664
    
    0 讨论(0)
  • 2021-01-31 09:32

    For me I did it like that :

    def missing_percent(df):
            # Total missing values
            mis_val = df.isnull().sum()
            
            # Percentage of missing values
            mis_percent = 100 * df.isnull().sum() / len(df)
            
            # Make a table with the results
            mis_table = pd.concat([mis_val, mis_percent], axis=1)
            
            # Rename the columns
            mis_columns = mis_table.rename(
            columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
            
            # Sort the table by percentage of missing descending
            mis_columns = mis_columns[
                mis_columns.iloc[:,1] != 0].sort_values(
            'Percent of Total Values', ascending=False).round(2)
            
            # Print some summary information
            print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
                "There are " + str(mis_columns.shape[0]) +
                  " columns that have missing values.")
            
            # Return the dataframe with missing information
            return mis_columns
    
    0 讨论(0)
  • 2021-01-31 09:33
    import numpy as np
    import pandas as pd
    
    raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
            'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
            'age': [22, np.nan, 23, 24, 25], 
            'sex': ['m', np.nan, 'f', 'm', 'f'], 
            'Test1_Score': [4, np.nan, 0, 0, 0],
            'Test2_Score': [25, np.nan, np.nan, 0, 0]}
    results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
    
    
    results 
    
      first_name last_name   age  sex  Test1_Score  Test2_Score
    0      Jason    Miller  22.0    m          4.0         25.0
    1        NaN       NaN   NaN  NaN          NaN          NaN
    2       Tina       NaN  23.0    f          0.0          NaN
    3       Jake    Milner  24.0    m          0.0          0.0
    4        Amy     Cooze  25.0    f          0.0          0.0
    

    You can use following function, which will give you output in Dataframe

    • Zero Values
    • Missing Values
    • % of Total Values
    • Total Zero Missing Values
    • % Total Zero Missing Values
    • Data Type

    Just copy and paste following function and call it by passing your pandas Dataframe

    def missing_zero_values_table(df):
            zero_val = (df == 0.00).astype(int).sum(axis=0)
            mis_val = df.isnull().sum()
            mis_val_percent = 100 * df.isnull().sum() / len(df)
            mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
            mz_table = mz_table.rename(
            columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
            mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
            mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
            mz_table['Data Type'] = df.dtypes
            mz_table = mz_table[
                mz_table.iloc[:,1] != 0].sort_values(
            '% of Total Values', ascending=False).round(1)
            print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
                "There are " + str(mz_table.shape[0]) +
                  " columns that have missing values.")
    #         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
            return mz_table
    
    missing_zero_values_table(results)
    

    Output

    Your selected dataframe has 6 columns and 5 Rows.
    There are 6 columns that have missing values.
    
                 Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
    last_name              0               2               40.0                          2                         40.0    object
    Test2_Score            2               2               40.0                          4                         80.0   float64
    first_name             0               1               20.0                          1                         20.0    object
    age                    0               1               20.0                          1                         20.0   float64
    sex                    0               1               20.0                          1                         20.0    object
    Test1_Score            3               1               20.0                          4                         80.0   float64
    

    If you want to keep it simple then you can use following function to get missing values in %

    def missing(dff):
        print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
    
    
    missing(results)
    
    Test2_Score    40.0
    last_name      40.0
    Test1_Score    20.0
    sex            20.0
    age            20.0
    first_name     20.0
    dtype: float64
    
    0 讨论(0)
  • 2021-01-31 09:35

    The solution you're looking for is :

    round(df.isnull().mean()*100,2) 
    

    This will round up the percentage upto 2 decimal places

    Another way to do this is

    round((df.isnull().sum()*100)/len(df),2)
    

    but this is not efficient as using mean() is.

    0 讨论(0)
  • 2021-01-31 09:42

    Let's break down your ask

    1. you want the percentage of missing value
    2. it should be sorted in ascending order and the values to be rounded to 2 floating point

    Explanation:

    1. dhr[fill_cols].isnull().sum() - gives the total number of missing values column wise
    2. dhr.shape[0] - gives the total number of rows
    3. (dhr[fill_cols].isnull().sum()/dhr.shape[0]) - gives you a series with percentage as values and column names as index
    4. since the output is a series you can round and sort based on the values

    code:

    (dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()
    

    Reference: sort, round

    0 讨论(0)
提交回复
热议问题