I\'m curious about the behavior of pandas groupby-apply when the apply function returns a series.
When the series are of different lengths, it returns a multi-indexed se
In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combine docs, running a groupby() refers to one or more of the following
- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure
Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby()
operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.
For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A
group having length 2 and B
group length 3.
def f(x):
print(x)
return pd.Series(x['city'].values, index=range(len(x)))
s1 = df1.groupby('state').apply(f)
print(s1)
# city state
# 0 v A
# 1 w A
# city state
# 0 v A
# 1 w A
# city state
# 2 x B
# 3 y B
# 4 z B
# state
# A 0 v
# 1 w
# B 0 x
# 1 y
# 2 z
# dtype: object
However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:
df = df1.groupby('state').apply(f).reset_index()
print(df)
# state level_1 0
# 0 A 0 v
# 1 A 1 w
# 2 B 0 x
# 3 B 1 y
# 4 B 2 z
But more relevant to your needs is unstack() which pivots a level of the index labels, yielding a data frame. Consider fillna()
to fill the None
outcome.
df = df1.groupby('state').apply(f).unstack()
print(df)
# 0 1 2
# state
# A v w None
# B x y z
instead of doing index=range(len(x))
in your function f, you
can do index=x.index
to prevent this undesired behavior