How to replace missing values with group mode in Pandas?

我们两清 提交于 2019-12-06 16:49:27

mode is quite difficult, given that there really isn't any agreed upon way to deal with ties. Plus it's typically very slow. Here's one way that will be "fast". We'll define a function that calculates the mode for each group, then we can fill the missing values afterwards with a map. We don't run into issues with missing groups, though for ties we arbitrarily choose the modal value that comes first when sorted:

def fast_mode(df, key_cols, value_col):
    """ 
    Calculate a column mode, by group, ignoring null values. 

    Parameters
    ----------
    df : pandas.DataFrame
        DataFrame over which to calcualate the mode. 
    key_cols : list of str
        Columns to groupby for calculation of mode.
    value_col : str
        Column for which to calculate the mode. 

    Return
    ------ 
    pandas.DataFrame
        One row for the mode of value_col per key_cols group. If ties, 
        returns the one which is sorted first. 
    """
    return (df.groupby(key_cols + [value_col]).size() 
              .to_frame('counts').reset_index() 
              .sort_values('counts', ascending=False) 
              .drop_duplicates(subset=key_cols)).drop(columns='counts')

Sample data df:

   CIK  SIK
0    C  2.0
1    C  1.0
2    B  NaN
3    B  3.0
4    A  NaN
5    A  3.0
6    C  NaN
7    B  NaN
8    C  1.0
9    A  2.0
10   D  NaN
11   D  NaN
12   D  NaN

Code:

df.loc[df.SIK.isnull(), 'SIK'] = df.CIK.map(fast_mode(df, ['CIK'], 'SIK').set_index('CIK').SIK)

Output df:

   CIK  SIK
0    C  2.0
1    C  1.0
2    B  3.0
3    B  3.0
4    A  2.0
5    A  3.0
6    C  1.0
7    B  3.0
8    C  1.0
9    A  2.0
10   D  NaN
11   D  NaN
12   D  NaN
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!