Answering this question it turned out that df.groupby(...).agg(set)
and df.groupby(...).agg(lambda x: set(x))
are producing different results.
Perhaps as @Edchum commented agg
applies the python builtin functions considering the groupby object as a mini dataframe, whereas when a defined function is passed it applies it for every column. An example to illustrate this is via print.
df.groupby('user_id').agg(print,end='\n\n')
class_type instructor user_id
0 Krav Maga Bob 1
4 Ju-jitsu Alice 1
class_type instructor user_id
1 Yoga Alice 2
5 Krav Maga Alice 2
class_type instructor user_id
2 Ju-jitsu Bob 3
6 Karate Bob 3
df.groupby('user_id').agg(lambda x : print(x,end='\n\n'))
0 Krav Maga
4 Ju-jitsu
Name: class_type, dtype: object
1 Yoga
5 Krav Maga
Name: class_type, dtype: object
2 Ju-jitsu
6 Karate
Name: class_type, dtype: object
3 Krav Maga
Name: class_type, dtype: object
...
Hope this is the reason why applying set gave the result like the one mentioned above.
OK what is happening here is that set
isn't being handled as it's not is_list_like
in _aggregate
:
elif is_list_like(arg) and arg not in compat.string_types:
see source
this isn't is_list_like
so it returns None
up the call chain to end up at this line:
results.append(colg.aggregate(a))
see source
this raises TypeError
as TypeError: 'type' object is not iterable
which then raises:
if not len(results):
raise ValueError("no results")
see source
so because we have no results we end up calling _aggregate_generic
:
see source
this then calls:
result[name] = self._try_cast(func(data, *args, **kwargs)
see source
This then ends up as:
(Pdb) n
> c:\programdata\anaconda3\lib\site-packages\pandas\core\groupby.py(3779)_aggregate_generic()
-> return self._wrap_generic_output(result, obj)
(Pdb) result
{1: {'user_id', 'instructor', 'class_type'}, 2: {'user_id', 'instructor', 'class_type'}, 3: {'user_id', 'instructor', 'class_type'}, 4: {'user_id', 'instructor', 'class_type'}}
I'm running a slightly different version of pandas but the equivalent source line is https://github.com/pandas-dev/pandas/blob/v0.22.0/pandas/core/groupby.py#L3779
So essentially because set
doesn't count as a function or an iterable, it just collapses to calling the ctor on the series iterable which in this case are the columns, you can see the same effect here:
In [8]:
df.groupby('user_id').agg(lambda x: print(set(x.columns)))
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
{'class_type', 'instructor', 'user_id'}
Out[8]:
class_type instructor
user_id
1 None None
2 None None
3 None None
4 None None
but when you use the lambda
which is an anonymous function this works as expected.