How to replace missing values with group mode in Pandas?

后端 未结 1 1379
小蘑菇
小蘑菇 2020-12-20 04:03

I follow the method in this post to replace missing values with the group mode, but encounter the \"IndexError: index out of bounds\".

 df[\'SIC\'] = df.gro         


        
相关标签:
1条回答
  • 2020-12-20 04:38

    mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:

    def fast_mode(df, key_cols, value_col):
        """ 
        Calculate a column mode, by group, ignoring null values. 
    
        Parameters
        ----------
        df : pandas.DataFrame
            DataFrame over which to calcualate the mode. 
        key_cols : list of str
            Columns to groupby for calculation of mode.
        value_col : str
            Column for which to calculate the mode. 
    
        Return
        ------ 
        pandas.DataFrame
            One row for the mode of value_col per key_cols group. If ties, 
            returns the one which is sorted first. 
        """
        return (df.groupby(key_cols + [value_col]).size() 
                  .to_frame('counts').reset_index() 
                  .sort_values('counts', ascending=False) 
                  .drop_duplicates(subset=key_cols)).drop(columns='counts')
    

    Sample data df:

       CIK  SIK
    0    C  2.0
    1    C  1.0
    2    B  NaN
    3    B  3.0
    4    A  NaN
    5    A  3.0
    6    C  NaN
    7    B  NaN
    8    C  1.0
    9    A  2.0
    10   D  NaN
    11   D  NaN
    12   D  NaN
    

    Code:

    df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)
    

    Output df:

       CIK  SIK
    0    C  2.0
    1    C  1.0
    2    B  3.0
    3    B  3.0
    4    A  2.0
    5    A  3.0
    6    C  1.0
    7    B  3.0
    8    C  1.0
    9    A  2.0
    10   D  NaN
    11   D  NaN
    12   D  NaN
    
    0 讨论(0)
提交回复
热议问题