GroupBy pandas DataFrame and select most common value

后端 未结 10 1721
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

相关标签:
10条回答
  • 2020-11-22 08:45

    A slightly clumsier but faster approach for larger datasets involves getting the counts for a column of interest, sorting the counts highest to lowest, and then de-duplicating on a subset to only retain the largest cases. The code example is following:

    >>> import pandas as pd
    >>> source = pd.DataFrame(
            {
                'Country': ['USA', 'USA', 'Russia', 'USA'], 
                'City': ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                'Short name': ['NY', 'New', 'Spb', 'NY']
            }
        )
    >>> grouped_df = source\
            .groupby(['Country','City','Short name'])[['Short name']]\
            .count()\
            .rename(columns={'Short name':'count'})\
            .reset_index()\
            .sort_values('count', ascending=False)\
            .drop_duplicates(subset=['Country', 'City'])\
            .drop('count', axis=1)
    >>> print(grouped_df)
      Country              City Short name
    1     USA          New-York         NY
    0  Russia  Sankt-Petersburg        Spb
    
    0 讨论(0)
  • 2020-11-22 08:46

    A little late to the game here, but I was running into some performance issues with HYRY's solution, so I had to come up with another one.

    It works by finding the frequency of each key-value, and then, for each key, only keeping the value that appears with it most often.

    There's also an additional solution that supports multiple modes.

    On a scale test that's representative of the data I'm working with, this reduced runtime from 37.4s to 0.5s!

    Here's the code for the solution, some example usage, and the scale test:

    import numpy as np
    import pandas as pd
    import random
    import time
    
    test_input = pd.DataFrame(columns=[ 'key',          'value'],
                              data=  [[ 1,              'A'    ],
                                      [ 1,              'B'    ],
                                      [ 1,              'B'    ],
                                      [ 1,              np.nan ],
                                      [ 2,              np.nan ],
                                      [ 3,              'C'    ],
                                      [ 3,              'C'    ],
                                      [ 3,              'D'    ],
                                      [ 3,              'D'    ]])
    
    def mode(df, key_cols, value_col, count_col):
        '''                                                                                                                                                                                                                                                                                                                                                              
        Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
        for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
        that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         
    
        The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
        that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
        `value_col` for which you would like to obtain the mode.                                                                                                                                                                                                                                                                                                         
    
        The output is a DataFrame with a record per group that has at least one mode                                                                                                                                                                                                                                                                                     
        (null values are not counted). The `key_cols` are included as columns, `value_col`                                                                                                                                                                                                                                                                               
        contains a mode (ties are broken arbitrarily and deterministically) for each                                                                                                                                                                                                                                                                                     
        group, and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                 
        '''
        return df.groupby(key_cols + [value_col]).size() \
                 .to_frame(count_col).reset_index() \
                 .sort_values(count_col, ascending=False) \
                 .drop_duplicates(subset=key_cols)
    
    def modes(df, key_cols, value_col, count_col):
        '''                                                                                                                                                                                                                                                                                                                                                              
        Pandas does not provide a `mode` aggregation function                                                                                                                                                                                                                                                                                                            
        for its `GroupBy` objects. This function is meant to fill                                                                                                                                                                                                                                                                                                        
        that gap, though the semantics are not exactly the same.                                                                                                                                                                                                                                                                                                         
    
        The input is a DataFrame with the columns `key_cols`                                                                                                                                                                                                                                                                                                             
        that you would like to group on, and the column                                                                                                                                                                                                                                                                                                                  
        `value_col` for which you would like to obtain the modes.                                                                                                                                                                                                                                                                                                        
    
        The output is a DataFrame with a record per group that has at least                                                                                                                                                                                                                                                                                              
        one mode (null values are not counted). The `key_cols` are included as                                                                                                                                                                                                                                                                                           
        columns, `value_col` contains lists indicating the modes for each group,                                                                                                                                                                                                                                                                                         
        and `count_col` indicates how many times each mode appeared in its group.                                                                                                                                                                                                                                                                                        
        '''
        return df.groupby(key_cols + [value_col]).size() \
                 .to_frame(count_col).reset_index() \
                 .groupby(key_cols + [count_col])[value_col].unique() \
                 .to_frame().reset_index() \
                 .sort_values(count_col, ascending=False) \
                 .drop_duplicates(subset=key_cols)
    
    print test_input
    print mode(test_input, ['key'], 'value', 'count')
    print modes(test_input, ['key'], 'value', 'count')
    
    scale_test_data = [[random.randint(1, 100000),
                        str(random.randint(123456789001, 123456789100))] for i in range(1000000)]
    scale_test_input = pd.DataFrame(columns=['key', 'value'],
                                    data=scale_test_data)
    
    start = time.time()
    mode(scale_test_input, ['key'], 'value', 'count')
    print time.time() - start
    
    start = time.time()
    modes(scale_test_input, ['key'], 'value', 'count')
    print time.time() - start
    
    start = time.time()
    scale_test_input.groupby(['key']).agg(lambda x: x.value_counts().index[0])
    print time.time() - start
    

    Running this code will print something like:

       key value
    0    1     A
    1    1     B
    2    1     B
    3    1   NaN
    4    2   NaN
    5    3     C
    6    3     C
    7    3     D
    8    3     D
       key value  count
    1    1     B      2
    2    3     C      2
       key  count   value
    1    1      2     [B]
    2    3      2  [C, D]
    0.489614009857
    9.19386196136
    37.4375009537
    

    Hope this helps!

    0 讨论(0)
  • 2020-11-22 08:47

    Formally, the correct answer is the @eumiro Solution. The problem of @HYRY solution is that when you have a sequence of numbers like [1,2,3,4] the solution is wrong, i. e., you don't have the mode. Example:

    >>> import pandas as pd
    >>> df = pd.DataFrame(
            {
                'client': ['A', 'B', 'A', 'B', 'B', 'C', 'A', 'D', 'D', 'E', 'E', 'E', 'E', 'E', 'A'], 
                'total': [1, 4, 3, 2, 4, 1, 2, 3, 5, 1, 2, 2, 2, 3, 4], 
                'bla': [10, 40, 30, 20, 40, 10, 20, 30, 50, 10, 20, 20, 20, 30, 40]
            }
        )
    

    If you compute like @HYRY you obtain:

    >>> print(df.groupby(['client']).agg(lambda x: x.value_counts().index[0]))
            total  bla
    client            
    A           4   30
    B           4   40
    C           1   10
    D           3   30
    E           2   20
    

    Which is clearly wrong (see the A value that should be 1 and not 4) because it can't handle with unique values.

    Thus, the other solution is correct:

    >>> import scipy.stats
    >>> print(df.groupby(['client']).agg(lambda x: scipy.stats.mode(x)[0][0]))
            total  bla
    client            
    A           1   10
    B           4   40
    C           1   10
    D           3   30
    E           2   20
    
    0 讨论(0)
  • 2020-11-22 08:54

    You can use value_counts() to get a count series, and get the first row:

    import pandas as pd
    
    source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                      'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                      'Short name' : ['NY','New','Spb','NY']})
    
    source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
    

    In case you are wondering about performing other agg functions in the .agg() try this.

    # Let's add a new col,  account
    source['account'] = [1,2,3,3]
    
    source.groupby(['Country','City']).agg(mod  = ('Short name', \
                                            lambda x: x.value_counts().index[0]),
                                            avg = ('account', 'mean') \
                                          )
    
    0 讨论(0)
提交回复
热议问题