GroupBy pandas DataFrame and select most common value

后端 未结 10 1720
梦谈多话
梦谈多话 2020-11-22 07:59

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to

相关标签:
10条回答
  • 2020-11-22 08:36

    If you don't want to include NaN values, using Counter is much much faster than pd.Series.mode or pd.Series.value_counts()[0]:

    def get_most_common(srs):
        x = list(srs)
        my_counter = Counter(x)
        return my_counter.most_common(1)[0][0]
    
    df.groupby(col).agg(get_most_common)
    

    should work. This will fail when you have NaN values, as each NaN will be counted separately.

    0 讨论(0)
  • 2020-11-22 08:36

    The problem here is the performance, if you have a lot of rows it will be a problem.

    If it is your case, please try with this:

    import pandas as pd
    
    source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short_name' : ['NY','New','Spb','NY']})
    
    source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
    
    source.groupby(['Country','City']).Short_name.value_counts().groupby['Country','City']).first()
    
    0 讨论(0)
  • 2020-11-22 08:37

    If you want another approach for solving it that is does not depend on value_counts or scipy.stats you can use the Counter collection

    from collections import Counter
    get_most_common = lambda values: max(Counter(values).items(), key = lambda x: x[1])[0]
    

    Which can be applied to the above example like this

    src = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 
                  'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],
                  'Short_name' : ['NY','New','Spb','NY']})
    
    src.groupby(['Country','City']).agg(get_most_common)
    
    0 讨论(0)
  • 2020-11-22 08:39

    For agg, the lambba function gets a Series, which does not have a 'Short name' attribute.

    stats.mode returns a tuple of two arrays, so you have to take the first element of the first array in this tuple.

    With these two simple changements:

    source.groupby(['Country','City']).agg(lambda x: stats.mode(x)[0][0])
    

    returns

                             Short name
    Country City                       
    Russia  Sankt-Petersburg        Spb
    USA     New-York                 NY
    
    0 讨论(0)
  • 2020-11-22 08:40

    Pandas >= 0.16

    pd.Series.mode is available!

    Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

    source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    If this is needed as a DataFrame, use

    source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()
    
                             Short name
    Country City                       
    Russia  Sankt-Petersburg        Spb
    USA     New-York                 NY
    

    The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

    # Accepted answer.
    %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0])
    # Proposed in this post.
    %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    Dealing with Multiple Modes

    Series.mode also does a good job when there are multiple modes:

    source2 = source.append(
        pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),
        ignore_index=True)
    
    # Now `source2` has two modes for the 
    # ("USA", "New-York") group, they are "NY" and "New".
    source2
    
      Country              City Short name
    0     USA          New-York         NY
    1     USA          New-York        New
    2  Russia  Sankt-Petersburg        Spb
    3     USA          New-York         NY
    4     USA          New-York        New
    

    source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)
    
    Country  City            
    Russia   Sankt-Petersburg          Spb
    USA      New-York            [NY, New]
    Name: Short name, dtype: object
    

    Or, if you want a separate row for each mode, you can use GroupBy.apply:

    source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)
    
    Country  City               
    Russia   Sankt-Petersburg  0    Spb
    USA      New-York          0     NY
                               1    New
    Name: Short name, dtype: object
    

    If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

    source2.groupby(['Country','City'])['Short name'].agg(
        lambda x: pd.Series.mode(x)[0])
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    Alternatives to (not) consider

    You can also use statistics.mode from python, but...

    source.groupby(['Country','City'])['Short name'].apply(statistics.mode)
    
    Country  City            
    Russia   Sankt-Petersburg    Spb
    USA      New-York             NY
    Name: Short name, dtype: object
    

    ...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

    If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

    But you can see for yourself...

    statistics.mode([1, 2])
    # ---------------------------------------------------------------------------
    # StatisticsError                           Traceback (most recent call last)
    # ...
    # StatisticsError: no unique mode; found 2 equally common values
    
    0 讨论(0)
  • 2020-11-22 08:41

    The two top answers here suggest:

    df.groupby(cols).agg(lambda x:x.value_counts().index[0])
    

    or, preferably

    df.groupby(cols).agg(pd.Series.mode)
    

    However both of these fail in simple edge cases, as demonstrated here:

    df = pd.DataFrame({
        'client_id':['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C'],
        'date':['2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01', '2019-01-01'],
        'location':['NY', 'NY', 'LA', 'LA', 'DC', 'DC', 'LA', np.NaN]
    })
    

    The first:

    df.groupby(['client_id', 'date']).agg(lambda x:x.value_counts().index[0])
    

    yields IndexError (because of the empty Series returned by group C). The second:

    df.groupby(['client_id', 'date']).agg(pd.Series.mode)
    

    returns ValueError: Function does not reduce, since the first group returns a list of two (since there are two modes). (As documented here, if the first group returned a single mode this would work!)

    Two possible solutions for this case are:

    import scipy
    x.groupby(['client_id', 'date']).agg(lambda x: scipy.stats.mode(x)[0])
    

    And the solution given to me by cs95 in the comments here:

    def foo(x): 
        m = pd.Series.mode(x); 
        return m.values[0] if not m.empty else np.nan
    df.groupby(['client_id', 'date']).agg(foo)
    

    However, all of these are slow and not suited for large datasets. A solution I ended up using which a) can deal with these cases and b) is much, much faster, is a lightly modified version of abw33's answer (which should be higher):

    def get_mode_per_column(dataframe, group_cols, col):
        return (dataframe.fillna(-1)  # NaN placeholder to keep group 
                .groupby(group_cols + [col])
                .size()
                .to_frame('count')
                .reset_index()
                .sort_values('count', ascending=False)
                .drop_duplicates(subset=group_cols)
                .drop(columns=['count'])
                .sort_values(group_cols)
                .replace(-1, np.NaN))  # restore NaNs
    
    group_cols = ['client_id', 'date']    
    non_grp_cols = list(set(df).difference(group_cols))
    output_df = get_mode_per_column(df, group_cols, non_grp_cols[0]).set_index(group_cols)
    for col in non_grp_cols[1:]:
        output_df[col] = get_mode_per_column(df, group_cols, col)[col].values
    

    Essentially, the method works on one col at a time and outputs a df, so instead of concat, which is intensive, you treat the first as a df, and then iteratively add the output array (values.flatten()) as a column in the df.

    0 讨论(0)
提交回复
热议问题