Pandas: filling missing values by mean in each group

前端 未结 9 996
耶瑟儿~
耶瑟儿~ 2020-11-22 06:06

This should be straightforward, but the closest thing I\'ve found is this post: pandas: Filling missing values within a group, and I still can\'t solve my problem....

<
相关标签:
9条回答
  • 2020-11-22 06:39

    fillna + groupby + transform + mean

    This seems intuitive:

    df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
    

    The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to @DSM's solution, but avoids the need to define an anonymous lambda function.

    0 讨论(0)
  • 2020-11-22 06:40

    @DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:

    df = pd.DataFrame(
        {
            'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
            'name': ['A','A', 'B','B','B','B', 'C','C','C'],
            'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
            'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
        }
    )
    

    ... gives ...

      category name  other_value value
    0        X    A         10.0   1.0
    1        X    A          NaN   NaN
    2        X    B          NaN   NaN
    3        X    B         20.0   2.0
    4        X    B         30.0   3.0
    5        X    B         10.0   1.0
    6        Y    C         30.0   3.0
    7        Y    C          NaN   NaN
    8        Y    C         30.0   3.0
    

    In this generalized case we would like to group by category and name, and impute only on value.

    This can be solved as follows:

    df['value'] = df.groupby(['category', 'name'])['value']\
        .transform(lambda x: x.fillna(x.mean()))
    

    Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.

    Performance test by increasing the dataset by doing ...

    big_df = None
    for _ in range(10000):
        if big_df is None:
            big_df = df.copy()
        else:
            big_df = pd.concat([big_df, df])
    df = big_df
    

    ... confirms that this increases the speed proportional to how many columns you don't have to impute:

    import pandas as pd
    from datetime import datetime
    
    def generate_data():
        ...
    
    t = datetime.now()
    df = generate_data()
    df['value'] = df.groupby(['category', 'name'])['value']\
        .transform(lambda x: x.fillna(x.mean()))
    print(datetime.now()-t)
    
    # 0:00:00.016012
    
    t = datetime.now()
    df = generate_data()
    df["value"] = df.groupby(['category', 'name'])\
        .transform(lambda x: x.fillna(x.mean()))['value']
    print(datetime.now()-t)
    
    # 0:00:00.030022
    

    On a final note you can generalize even further if you want to impute more than one column, but not all:

    df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
        .transform(lambda x: x.fillna(x.mean()))
    
    0 讨论(0)
  • 2020-11-22 06:45

    The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:

    df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
        lambda x: x.fillna(x.mean()))
    
    0 讨论(0)
  • 2020-11-22 06:51

    Most of above answers involved using "groupby" and "transform" to fill the missing values.

    But i prefer using "groupby" with "apply" to fill the missing values which is more intuitive to me.

    >>> df['value']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
    >>> df.isnull().sum().sum()
        0 
    

    Shortcut: Groupby + Apply/Lambda + Fillna + Mean

    This solution still works if you want to group by multiple columns to replace missing values.

         >>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3], 
        'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})  
    
         >>> df
       value name   class
    0    1.0    A     p
    1    NaN    A     p
    2    NaN    B     q
    3    2.0    B     q
    4    3.0    B     r
    5    NaN    B     r
    6    NaN    C     s
    7    4.0    C     s
    8    3.0    C     s
    
    >>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
    
    >>> df
            value name   class
        0    1.0    A     p
        1    1.0    A     p
        2    2.0    B     q
        3    2.0    B     q
        4    3.0    B     r
        5    3.0    B     r
        6    3.5    C     s
        7    4.0    C     s
        8    3.0    C     s
    
    0 讨论(0)
  • One way would be to use transform:

    >>> df
      name  value
    0    A      1
    1    A    NaN
    2    B    NaN
    3    B      2
    4    B      3
    5    B      1
    6    C      3
    7    C    NaN
    8    C      3
    >>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
    >>> df
      name  value
    0    A      1
    1    A      1
    2    B      2
    3    B      2
    4    B      3
    5    B      1
    6    C      3
    7    C      3
    8    C      3
    
    0 讨论(0)
  • 2020-11-22 06:52
    def groupMeanValue(group):
        group['value'] = group['value'].fillna(group['value'].mean())
        return group
    
    dft = df.groupby("name").transform(groupMeanValue)
    
    0 讨论(0)
提交回复
热议问题