pandas groupby-apply behavior, returning a Series (inconsistent output type)

后端 未结 2 588
隐瞒了意图╮
隐瞒了意图╮ 2021-02-05 09:39

I\'m curious about the behavior of pandas groupby-apply when the apply function returns a series.

When the series are of different lengths, it returns a multi-indexed se

相关标签:
2条回答
  • 2021-02-05 10:27

    In essence, a dataframe consists of equal-length series (technically a dictionary container of Series objects). As stated in the pandas split-apply-combine docs, running a groupby() refers to one or more of the following

    • Splitting the data into groups based on some criteria
    • Applying a function to each group independently
    • Combining the results into a data structure

    Notice this does not state a data frame is always produced, but a generalized data structure. So a groupby() operation can downcast to a Series, or if given a Series as input, can upcast to dataframe.

    For your first dataframe, you run unequal groupings (or unequal index lengths) coercing a series return which in the "combine" processing does not adequately yield a data frame. Since a data frame cannot combine different length series it instead yields a multi-index series. You can see this with print statements in the defined function with the state==A group having length 2 and B group length 3.

    def f(x):
        print(x)
        return pd.Series(x['city'].values, index=range(len(x)))
    
    s1 = df1.groupby('state').apply(f)
    
    print(s1)
    #   city state
    # 0    v     A
    # 1    w     A
    #   city state
    # 0    v     A
    # 1    w     A
    #   city state
    # 2    x     B
    # 3    y     B
    # 4    z     B
    # state   
    # A      0    v
    #        1    w
    # B      0    x
    #        1    y
    #        2    z
    # dtype: object
    

    However, you can manipulate the multi-index series outcome by resetting index and thereby adjusting its hierarchical levels:

    df = df1.groupby('state').apply(f).reset_index()
    print(df)
    
    #   state  level_1  0
    # 0     A        0  v
    # 1     A        1  w
    # 2     B        0  x
    # 3     B        1  y
    # 4     B        2  z
    

    But more relevant to your needs is unstack() which pivots a level of the index labels, yielding a data frame. Consider fillna() to fill the None outcome.

    df = df1.groupby('state').apply(f).unstack()
    print(df)
    
    #        0  1     2
    # state            
    # A      v  w  None
    # B      x  y     z
    
    0 讨论(0)
  • 2021-02-05 10:27

    instead of doing index=range(len(x)) in your function f, you can do index=x.index to prevent this undesired behavior

    0 讨论(0)
提交回复
热议问题