GroupBy pandas DataFrame and select most common value

后端 未结 10 1726
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

10条回答
  •  栀梦
    栀梦 (楼主)
    2020-11-22 08:46

    A little late to the game here, but I was running into some performance issues with HYRY's solution, so I had to come up with another one.

    It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

    There's also an additional solution that supports multiple modes.

    On a scale test that's representative of the data I'm working with, this reduced runtime from 37.4s to 0.5s!

    Here's the code for the solution, some example usage, and the scale test:

    import numpy as np
    import pandas as pd
    import random
    import time
    
    test_input = pd.DataFrame(columns=[ 'key',          'value'],
                              data=  [[ 1,              'A'    ],
                                      [ 1,              'B'    ],
                                      [ 1,              'B'    ],
                                      [ 1,              np.nan ],
                                      [ 2,              np.nan ],
                                      [ 3,              'C'    ],
                                      [ 3,              'C'    ],
                                      [ 3,              'D'    ],
                                      [ 3,              'D'    ]])
    
    def mode(df, key_cols, value_col, count_col):
        '''                                                                                                                                                                                                                                                                                                                                                              
        Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
        for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
        that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         
    
        The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
        that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
        `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         
    
        The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
        (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
        contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
        group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
        '''
        return df.groupby(key_cols + [value_col]).size() \
                 .to_frame(count_col).reset_index() \
                 .sort_values(count_col, ascending=False) \
                 .drop_duplicates(subset=key_cols)
    
    def modes(df, key_cols, value_col, count_col):
        '''                                                                                                                                                                                                                                                                                                                                                              
        Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
        for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
        that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         
    
        The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
        that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
        `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        
    
        The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
        one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
        columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
        and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
        '''
        return df.groupby(key_cols + [value_col]).size() \
                 .to_frame(count_col).reset_index() \
                 .groupby(key_cols + [count_col])[value_col].unique() \
                 .to_frame().reset_index() \
                 .sort_values(count_col, ascending=False) \
                 .drop_duplicates(subset=key_cols)
    
    print test_input
    print mode(test_input, ['key'], 'value', 'count')
    print modes(test_input, ['key'], 'value', 'count')
    
    scale_test_data = [[random.randint(1, 100000),
                        str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
    scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                    data=scale_test_data)
    
    start = time.time()
    mode(scale_test_input, ['key'], 'value', 'count')
    print time.time() - start
    
    start = time.time()
    modes(scale_test_input, ['key'], 'value', 'count')
    print time.time() - start
    
    start = time.time()
    scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
    print time.time() - start
    

    Running this code will print something like:

       key value
    0    1     A
    1    1     B
    2    1     B
    3    1   NaN
    4    2   NaN
    5    3     C
    6    3     C
    7    3     D
    8    3     D
       key value  count
    1    1     B      2
    2    3     C      2
       key  count   value
    1    1      2     [B]
    2    3      2  [C, D]
    0.489614009857
    9.19386196136
    37.4375009537
    

    Hope this helps!

提交回复
热议问题