what is different between groupby.first, groupby.nth, groupby.head when as_index=False

后端 未结 2 891
孤城傲影
孤城傲影 2020-12-21 07:01

Edit: the rookie mistake I made in string np.nan having pointed out by @coldspeed, @wen-ben, @ALollz. Answers are quite good, so I don\'t d

相关标签:
2条回答
  • 2020-12-21 07:40

    Here is the different, you need to make the np.nan to NaN , in your original df it is string , after convert it , you will see the different

    df=df.mask(df=='np.nan')
    df.groupby('A', as_index=False).head(1) #df.groupby('A', as_index=False).nth(0)
    
    Out[8]: 
       A    B
    0  1  NaN
    3  2    8
    df.groupby('A', as_index=False).first() 
    # the reason why first have the index reset, 
    #since it will have chance select the value from different row within the group, 
    #when the first item is NaN it will skip it to find the first not null value 
    #rather than from the same row, 
    #If still keep the original row index will be misleading. 
    Out[9]: 
       A  B
    0  1  4
    1  2  8
    
    0 讨论(0)
  • 2020-12-21 08:03

    The major issue is that you likely have the string 'np.nan' stored and not a real null value. Here are how the three handle null values differently:

    Sample Data:

    import pandas as pd
    df = pd.DataFrame({'A': [1,1,2,2,3,3], 'B': [None, '1', np.NaN, '2', 3, 4]})
    

    first

    This will return the first non-null value within each group. Oddly enough it will not skip None, though this can be made possible with the kwarg dropna=True. As a result, you may return values for columns that were part of different rows originally:

    df.groupby('A', as_index=False).first()
    #   A     B
    #0  1  None
    #1  2     2
    #2  3     3
    
    df.groupby('A', as_index=False).first(dropna=True)
    #   A  B
    #0  1  1
    #1  2  2
    #2  3  3
    

    head(n)

    Returns the top n rows within a group. Values remain bound within rows. If you give it an n that is more than the number of rows, it returns all rows in that group without complaining:

    df.groupby('A', as_index=False).head(1)
    #   A     B
    #0  1  None
    #2  2   NaN
    #4  3     3
    
    df.groupby('A', as_index=False).head(200)
    #   A     B
    #0  1  None
    #1  1     1
    #2  2   NaN
    #3  2     2
    #4  3     3
    #5  3     4
    

    nth:

    This takes the nth row, so again values remain bound within the row. .nth(0) is the same as .head(1), though they have different uses. For instance, if you need the 0th and 2nd row, that's difficult to do with .head(), but easy with .nth([0,2]). Also it's fair easier to write .head(10) than .nth(list(range(10)))).

    df.groupby('A', as_index=False).nth(0)
    #   A     B
    #0  1  None
    #2  2   NaN
    #4  3     3
    

    nth also supports dropping rows with any null-values, so you can use it to return the first row without any null-values, unlike .head()

    df.groupby('A', as_index=False).nth(0, dropna='any')
    #   A  B
    #A      
    #1  1  1
    #2  2  2
    #3  3  3
    
    0 讨论(0)
提交回复
热议问题