Pandas, groupby and count

后端 未结 3 1978
借酒劲吻你
借酒劲吻你 2020-11-29 10:08

I have a dataframe say like this

>>> df = pd.DataFrame({\'user_id\':[\'a\',\'a\',\'s\',\'s\',\'s\'],
                    \'session\':[4,5,4,5,5],
           


        
相关标签:
3条回答
  • 2020-11-29 10:40

    pandas >= 1.1: df.value_counts is available!

    From pandas 1.1, this will be my recommended method for counting the number of rows in groups (i.e., the group size). To count the number of non-nan rows in a group for a specific column, check out the accepted answer.

    Old

    df.groupby(['A', 'B']).size()   # df.groupby(['A', 'B'])['C'].count()
    

    New [✓]

    df.value_counts(subset=['A', 'B']) 
    

    Note that size and count are not identical, the former counts all rows per group, the latter counts non-null rows only. See this other answer of mine for more.


    Minimal Example

    pd.__version__
    # '1.1.0.dev0+2004.g8d10bfb6f'
    
    df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
                       'num_wings': [2, 0, 0, 0]},
                      index=['falcon', 'dog', 'cat', 'ant'])
    df
            num_legs  num_wings
    falcon         2          2
    dog            4          0
    cat            4          0
    ant            6          0
    
    df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
    
    num_legs  num_wings
    2         2            1
    4         0            2
    6         0            1
    dtype: int64
    

    Compare this output with

    df.groupby(['num_legs', 'num_wings'])['num_legs'].size()
    
    num_legs  num_wings
    2         2            1
    4         0            2
    6         0            1
    Name: num_legs, dtype: int64
    

    Performance

    It's also faster if you don't sort the result:

    %timeit df.groupby(['num_legs', 'num_wings'])['num_legs'].count()
    %timeit df.value_counts(subset=['num_legs', 'num_wings'], sort=False)
    
    640 µs ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    568 µs ± 6.88 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    0 讨论(0)
  • 2020-11-29 10:49

    I struggled with the same issue, made use of the solution provided above. You can actually designate any of the columns to count:

    df.groupby(['revenue','session','user_id'])['revenue'].count()
    

    and

    df.groupby(['revenue','session','user_id'])['session'].count()
    

    would give the same answer.

    0 讨论(0)
  • 2020-11-29 10:55

    You seem to want to group by several columns at once:

    df.groupby(['revenue','session','user_id'])['user_id'].count()
    

    should give you what you want

    0 讨论(0)
提交回复
热议问题