Pandas GroupBy.apply method duplicates first group

后端 未结 3 1686
忘掉有多难
忘掉有多难 2020-11-22 10:41

My first SO question: I am confused about this behavior of apply method of groupby in pandas (0.12.0-4), it appears to apply the function TWICE to the first row of a data fr

相关标签:
3条回答
  • 2020-11-22 11:10

    This "issue" has now been fixed: Upgrade to 0.25+

    Starting from v0.25, GroupBy.apply() will only evaluate the first group once. See GH24748.

    What’s new in 0.25.0 (July 18, 2019): Groupby.apply on DataFrame evaluates first group only once

    Relevant example from documentation:

    pd.__version__                                                                                                          
    # '0.25.0.dev0+590.g44d5498d8'
    
    df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})                                                                      
    
    def func(group): 
        print(group.name) 
        return group                                                                                                                     
    

    New behaviour (>=v0.25):

    df.groupby('a').apply(func)                                                                                            
    x
    y
    
       a  b
    0  x  1
    1  y  2
    

    Old behaviour (<=v0.24.x):

    df.groupby('a').apply(func)
    x
    x
    y
    
       a  b
    0  x  1
    1  y  2
    

    Pandas still uses the first group to determine whether apply can take a fast path or not. But at least it no longer has to evaluate the first group twice. Nice work, devs!

    0 讨论(0)
  • 2020-11-22 11:11

    This is by design, as described here and here

    The apply function needs to know the shape of the returned data to intelligently figure out how it will be combined. To do this it calls the function (checkit in your case) twice to achieve this.

    Depending on your actual use case, you can replace the call to apply with aggregate, transform or filter, as described in detail here. These functions require the return value to be a particular shape, and so don't call the function twice.

    However - if the function you are calling does not have side-effects, it most likely does not matter that the function is being called twice on the first value.

    0 讨论(0)
  • 2020-11-22 11:19

    you can use for loop to avoid the groupby.apply duplicate first row,

    log_sample.csv

    guestid,keyword
    1,null
    2,null
    2,null
    3,null
    3,null
    3,null
    4,null
    4,null
    4,null
    4,null
    

    my code snippit

    df=pd.read_csv("log_sample.csv") 
    grouped = df.groupby("guestid")
    
    for guestid, df_group in grouped:
        print(list(df_group['guestid'])) 
    
    df.head(100)
    

    output

    [1]
    [2, 2]
    [3, 3, 3]
    [4, 4, 4, 4]
    
    0 讨论(0)
提交回复
热议问题