How to fill null values in a Dataset using python that matches with two other columns?

前端 未结 1 357
心在旅途
心在旅途 2021-01-25 17:12

I have a titanic Dataset. It has attributes and i was working manly on 1.Age 2.Embark ( from which port passengers embarked..There are total 3 ports..S,Q and C) 3.Survived ( 0

相关标签:
1条回答
  • 2021-01-25 17:50

    I think you need groupby with apply with fillna by mean:

    titanic['age'] = titanic.groupby(['survived','embarked'])['age']
                            .apply(lambda x: x.fillna(x.mean()))
    

    import seaborn as sns
    
    titanic = sns.load_dataset('titanic')
    #check NaN rows in age
    print (titanic[titanic['age'].isnull()].head(10))
        survived  pclass     sex  age  sibsp  parch      fare embarked   class  \
    5          0       3    male  NaN      0      0    8.4583        Q   Third   
    17         1       2    male  NaN      0      0   13.0000        S  Second   
    19         1       3  female  NaN      0      0    7.2250        C   Third   
    26         0       3    male  NaN      0      0    7.2250        C   Third   
    28         1       3  female  NaN      0      0    7.8792        Q   Third   
    29         0       3    male  NaN      0      0    7.8958        S   Third   
    31         1       1  female  NaN      1      0  146.5208        C   First   
    32         1       3  female  NaN      0      0    7.7500        Q   Third   
    36         1       3    male  NaN      0      0    7.2292        C   Third   
    42         0       3    male  NaN      0      0    7.8958        C   Third   
    
          who  adult_male deck  embark_town alive  alone  
    5     man        True  NaN   Queenstown    no   True  
    17    man        True  NaN  Southampton   yes   True  
    19  woman       False  NaN    Cherbourg   yes   True  
    26    man        True  NaN    Cherbourg    no   True  
    28  woman       False  NaN   Queenstown   yes   True  
    29    man        True  NaN  Southampton    no   True  
    31  woman       False    B    Cherbourg   yes  False  
    32  woman       False  NaN   Queenstown   yes   True  
    36    man        True  NaN    Cherbourg   yes   True  
    42    man        True  NaN    Cherbourg    no   True 
    

    idx = titanic[titanic['age'].isnull()].index
    titanic['age'] = titanic.groupby(['survived','embarked'])['age']
                            .apply(lambda x: x.fillna(x.mean()))
    
    #check if values was replaced
    print (titanic.loc[idx].head(10))
        survived  pclass     sex        age  sibsp  parch      fare embarked  \
    5          0       3    male  30.325000      0      0    8.4583        Q   
    17         1       2    male  28.113184      0      0   13.0000        S   
    19         1       3  female  28.973671      0      0    7.2250        C   
    26         0       3    male  33.666667      0      0    7.2250        C   
    28         1       3  female  22.500000      0      0    7.8792        Q   
    29         0       3    male  30.203966      0      0    7.8958        S   
    31         1       1  female  28.973671      1      0  146.5208        C   
    32         1       3  female  22.500000      0      0    7.7500        Q   
    36         1       3    male  28.973671      0      0    7.2292        C   
    42         0       3    male  33.666667      0      0    7.8958        C   
    
         class    who  adult_male deck  embark_town alive  alone  
    5    Third    man        True  NaN   Queenstown    no   True  
    17  Second    man        True  NaN  Southampton   yes   True  
    19   Third  woman       False  NaN    Cherbourg   yes   True  
    26   Third    man        True  NaN    Cherbourg    no   True  
    28   Third  woman       False  NaN   Queenstown   yes   True  
    29   Third    man        True  NaN  Southampton    no   True  
    31   First  woman       False    B    Cherbourg   yes  False  
    32   Third  woman       False  NaN   Queenstown   yes   True  
    36   Third    man        True  NaN    Cherbourg   yes   True  
    42   Third    man        True  NaN    Cherbourg    no   True  
    

    #check mean values
    print (titanic.groupby(['survived','embarked'])['age'].mean())
    survived  embarked
    0         C           33.666667
              Q           30.325000
              S           30.203966
    1         C           28.973671
              Q           22.500000
              S           28.113184
    Name: age, dtype: float64
    
    0 讨论(0)
提交回复
热议问题