Find out the percentage of missing values in each column in the given dataset

前端 未结 11 1167
逝去的感伤
逝去的感伤 2021-01-31 08:38
import pandas as pd
df = pd.read_csv(\'https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0\')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / l         


        
相关标签:
11条回答
  • 2021-01-31 09:17

    Update let's use mean with isnull:

    df.isnull().mean() * 100
    

    Output:

    Ord_id                 0.000000
    Prod_id                0.000000
    Ship_id                0.000000
    Cust_id                0.000000
    Sales                  0.238124
    Discount               0.654840
    Order_Quantity         0.654840
    Profit                 0.654840
    Shipping_Cost          0.654840
    Product_Base_Margin    1.297774
    dtype: float64
    

    IIUC:

    df.isnull().sum() / df.shape[0] * 100.00
    

    Output:

    Ord_id                 0.000000
    Prod_id                0.000000
    Ship_id                0.000000
    Cust_id                0.000000
    Sales                  0.238124
    Discount               0.654840
    Order_Quantity         0.654840
    Profit                 0.654840
    Shipping_Cost          0.654840
    Product_Base_Margin    1.297774
    dtype: float64
    
    0 讨论(0)
  • 2021-01-31 09:17

    To cover all missing values and round the results:

    ((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)
    

    The output:

    Out[556]: 
    Ord_id                 0.00
    Prod_id                0.00
    Ship_id                0.00
    Cust_id                0.00
    Sales                  0.24
    Discount               0.65
    Order_Quantity         0.65
    Profit                 0.65
    Shipping_Cost          0.65
    Product_Base_Margin    1.30
    dtype: float64
    
    0 讨论(0)
  • 2021-01-31 09:24
    import numpy as np
    
    import pandas as pd
    
    df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
    
    df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()
    
    print(round(100*(df.isnull().sum()/len(df.index)), 2))
    
    0 讨论(0)
  • 2021-01-31 09:29

    How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

    percent_missing = df.isnull().sum() * 100 / len(df)
    missing_value_df = pd.DataFrame({'column_name': df.columns,
                                     'percent_missing': percent_missing})
    

    And if you want the missing percentages sorted, follow the above with:

    missing_value_df.sort_values('percent_missing', inplace=True)
    

    As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

    percent_missing = df.isnull().sum() * 100 / len(df)
    
    0 讨论(0)
  • 2021-01-31 09:30

    single line solution

    df.isnull().mean().round(4).mul(100).sort_values(ascending=False)
    
    0 讨论(0)
  • 2021-01-31 09:31

    If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

    def miss_data(df):
        x = ['column_name','missing_data', 'missing_in_percentage']
        missing_data = pd.DataFrame(columns=x)
        columns = df.columns
        for col in columns:
            icolumn_name = col
            imissing_data = df[col].isnull().sum()
            imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
    
            missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
        print(missing_data) 
    
    0 讨论(0)
提交回复
热议问题