df.groupby(…).agg(set) produces different result compared to df.groupby(…).agg(lambda x: set(x))

后端 未结 2 1186
谎友^
谎友^ 2021-02-07 07:43

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

相关标签:
2条回答
  • 2021-02-07 08:47

    Perhaps as @Edchum commented agg applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.

    df.groupby('user_id').agg(print,end='\n\n')
    
     class_type instructor  user_id
    0  Krav Maga        Bob        1
    4   Ju-jitsu      Alice        1
    
      class_type instructor  user_id
    1       Yoga      Alice        2
    5  Krav Maga      Alice        2
    
      class_type instructor  user_id
    2   Ju-jitsu        Bob        3
    6     Karate        Bob        3
    
    
    df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))
    
    0    Krav Maga
    4     Ju-jitsu
    Name: class_type, dtype: object
    
    1         Yoga
    5    Krav Maga
    Name: class_type, dtype: object
    
    2    Ju-jitsu
    6      Karate
    Name: class_type, dtype: object
    
    3    Krav Maga
    Name: class_type, dtype: object
    
    ...
    

    Hope this is the reason why applying set gave the result like the one mentioned above.

    0 讨论(0)
  • 2021-02-07 08:48

    OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

    elif is_list_like(arg) and arg not in compat.string_types:
    

    see source

    this isn't is_list_like so it returns None up the call chain to end up at this line:

    results.append(colg.aggregate(a))
    

    see source

    this raises TypeError as TypeError: 'type' object is not iterable

    which then raises:

    if not len(results):
        raise ValueError("no results")
    

    see source

    so because we have no results we end up calling _aggregate_generic:

    see source

    this then calls:

    result[name] = self._try_cast(func(data, *args, **kwargs)
    

    see source

    This then ends up as:

    (Pdb) n
    > c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
    -> return self._wrap_generic_output(result, obj)
    
    (Pdb) result
    {1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}
    

    I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

    So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

    In [8]:
    
    df.groupby('user_id').agg(lambda x: print(set(x.columns)))
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    Out[8]: 
            class_type instructor
    user_id                      
    1             None       None
    2             None       None
    3             None       None
    4             None       None
    

    but when you use the lambda which is an anonymous function this works as expected.

    0 讨论(0)
提交回复
热议问题