df.groupby(…).agg(set) produces different result compared to df.groupby(…).agg(lambda x: set(x))

后端 未结 2 1187
谎友^
谎友^ 2021-02-07 07:43

Answering this question it turned out that df.groupby(...).agg(set) and df.groupby(...).agg(lambda x: set(x)) are producing different results.

2条回答
  •  忘了有多久
    2021-02-07 08:48

    OK what is happening here is that set isn't being handled as it's not is_list_like in _aggregate:

    elif is_list_like(arg) and arg not in compat.string_types:
    

    see source

    this isn't is_list_like so it returns None up the call chain to end up at this line:

    results.append(colg.aggregate(a))
    

    see source

    this raises TypeError as TypeError: 'type' object is not iterable

    which then raises:

    if not len(results):
        raise ValueError("no results")
    

    see source

    so because we have no results we end up calling _aggregate_generic:

    see source

    this then calls:

    result[name] = self._try_cast(func(data, *args, **kwargs)
    

    see source

    This then ends up as:

    (Pdb) n
    > c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
    -> return self._wrap_generic_output(result, obj)
    
    (Pdb) result
    {1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}
    

    I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779

    So essentially because set doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:

    In [8]:
    
    df.groupby('user_id').agg(lambda x: print(set(x.columns)))
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    {'class_type', 'instructor', 'user_id'}
    Out[8]: 
            class_type instructor
    user_id                      
    1             None       None
    2             None       None
    3             None       None
    4             None       None
    

    but when you use the lambda which is an anonymous function this works as expected.

提交回复
热议问题