Find out the percentage of missing values in each column in the given dataset

前端 未结 11 1122
逝去的感伤
逝去的感伤 2021-01-31 08:38
import pandas as pd
df = pd.read_csv(\'https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0\')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / l         


        
11条回答
  •  轻奢々
    轻奢々 (楼主)
    2021-01-31 09:33

    import numpy as np
    import pandas as pd
    
    raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
            'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
            'age': [22, np.nan, 23, 24, 25], 
            'sex': ['m', np.nan, 'f', 'm', 'f'], 
            'Test1_Score': [4, np.nan, 0, 0, 0],
            'Test2_Score': [25, np.nan, np.nan, 0, 0]}
    results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
    
    
    results 
    
      first_name last_name   age  sex  Test1_Score  Test2_Score
    0      Jason    Miller  22.0    m          4.0         25.0
    1        NaN       NaN   NaN  NaN          NaN          NaN
    2       Tina       NaN  23.0    f          0.0          NaN
    3       Jake    Milner  24.0    m          0.0          0.0
    4        Amy     Cooze  25.0    f          0.0          0.0
    

    You can use following function, which will give you output in Dataframe

    • Zero Values
    • Missing Values
    • % of Total Values
    • Total Zero Missing Values
    • % Total Zero Missing Values
    • Data Type

    Just copy and paste following function and call it by passing your pandas Dataframe

    def missing_zero_values_table(df):
            zero_val = (df == 0.00).astype(int).sum(axis=0)
            mis_val = df.isnull().sum()
            mis_val_percent = 100 * df.isnull().sum() / len(df)
            mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
            mz_table = mz_table.rename(
            columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
            mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
            mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
            mz_table['Data Type'] = df.dtypes
            mz_table = mz_table[
                mz_table.iloc[:,1] != 0].sort_values(
            '% of Total Values', ascending=False).round(1)
            print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
                "There are " + str(mz_table.shape[0]) +
                  " columns that have missing values.")
    #         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
            return mz_table
    
    missing_zero_values_table(results)
    

    Output

    Your selected dataframe has 6 columns and 5 Rows.
    There are 6 columns that have missing values.
    
                 Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
    last_name              0               2               40.0                          2                         40.0    object
    Test2_Score            2               2               40.0                          4                         80.0   float64
    first_name             0               1               20.0                          1                         20.0    object
    age                    0               1               20.0                          1                         20.0   float64
    sex                    0               1               20.0                          1                         20.0    object
    Test1_Score            3               1               20.0                          4                         80.0   float64
    

    If you want to keep it simple then you can use following function to get missing values in %

    def missing(dff):
        print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
    
    
    missing(results)
    
    Test2_Score    40.0
    last_name      40.0
    Test1_Score    20.0
    sex            20.0
    age            20.0
    first_name     20.0
    dtype: float64
    

提交回复
热议问题