What is the difference between size and count in pandas?

后端 未结 5 1825
误落风尘
误落风尘 2020-11-22 04:37

That is the difference between groupby(\"x\").count and groupby(\"x\").size in pandas ?

Does size just exclude nil ?

相关标签:
5条回答
  • 2020-11-22 04:59

    size includes NaN values, count does not:

    In [46]:
    df = pd.DataFrame({'a':[0,0,1,2,2,2], 'b':[1,2,3,4,np.NaN,4], 'c':np.random.randn(6)})
    df
    
    Out[46]:
       a   b         c
    0  0   1  1.067627
    1  0   2  0.554691
    2  1   3  0.458084
    3  2   4  0.426635
    4  2 NaN -2.238091
    5  2   4  1.256943
    
    In [48]:
    print(df.groupby(['a'])['b'].count())
    print(df.groupby(['a'])['b'].size())
    
    a
    0    2
    1    1
    2    2
    Name: b, dtype: int64
    
    a
    0    2
    1    1
    2    3
    dtype: int64 
    
    0 讨论(0)
  • 2020-11-22 05:13

    When we are dealing with normal dataframes then only difference will be an inclusion of NAN values, means count does not include NAN values while counting rows.

    But if we are using these functions with the groupby then, to get the correct results by count() we have to associate any numeric field with the groupby to get the exact number of groups where for size() there is no need for this type of association.

    0 讨论(0)
  • 2020-11-22 05:19

    What is the difference between size and count in pandas?

    The other answers have pointed out the difference, however, it is not completely accurate to say "size counts NaNs while count does not". While size does indeed count NaNs, this is actually a consequence of the fact that size returns the size (or the length) of the object it is called on. Naturally, this also includes rows/values which are NaN.

    So, to summarize, size returns the size of the Series/DataFrame1,

    df = pd.DataFrame({'A': ['x', 'y', np.nan, 'z']})
    df
    
         A
    0    x
    1    y
    2  NaN
    3    z
    

    df.A.size
    # 4
    

    ...while count counts the non-NaN values:

    df.A.count()
    # 3 
    

    Notice that size is an attribute (gives the same result as len(df) or len(df.A)). count is a function.

    1. DataFrame.size is also an attribute and returns the number of elements in the DataFrame (rows x columns).


    Behaviour with GroupBy - Output Structure

    Besides the basic difference, there's also the difference in the structure of the generated output when calling GroupBy.size() vs GroupBy.count().

    df = pd.DataFrame({'A': list('aaabbccc'), 'B': ['x', 'x', np.nan, np.nan, np.nan, np.nan, 'x', 'x']})
    df
       A    B
    0  a    x
    1  a    x
    2  a  NaN
    3  b  NaN
    4  b  NaN
    5  c  NaN
    6  c    x
    7  c    x
    

    Consider,

    df.groupby('A').size()
    
    A
    a    3
    b    2
    c    3
    dtype: int64
    

    Versus,

    df.groupby('A').count()
    
       B
    A   
    a  2
    b  0
    c  2
    

    GroupBy.count returns a DataFrame when you call count on all column, while GroupBy.size returns a Series.

    The reason being that size is the same for all columns, so only a single result is returned. Meanwhile, the count is called for each column, as the results would depend on on how many NaNs each column has.


    Behavior with pivot_table

    Another example is how pivot_table treats this data. Suppose we would like to compute the cross tabulation of

    df
    
       A  B
    0  0  1
    1  0  1
    2  1  2
    3  0  2
    4  0  0
    
    pd.crosstab(df.A, df.B)  # Result we expect, but with `pivot_table`.
    
    B  0  1  2
    A         
    0  1  2  1
    1  0  0  1
    

    With pivot_table, you can issue size:

    df.pivot_table(index='A', columns='B', aggfunc='size', fill_value=0)
    
    B  0  1  2
    A         
    0  1  2  1
    1  0  0  1
    

    But count does not work; an empty DataFrame is returned:

    df.pivot_table(index='A', columns='B', aggfunc='count')
    
    Empty DataFrame
    Columns: []
    Index: [0, 1]
    

    I believe the reason for this is that 'count' must be done on the series that is passed to the values argument, and when nothing is passed, pandas decides to make no assumptions.

    0 讨论(0)
  • 2020-11-22 05:22

    Just to add a little bit to @Edchum's answer, even if the data has no NA values, the result of count() is more verbose, using the example before:

    grouped = df.groupby('a')
    grouped.count()
    Out[197]: 
       b  c
    a      
    0  2  2
    1  1  1
    2  2  3
    grouped.size()
    Out[198]: 
    a
    0    2
    1    1
    2    3
    dtype: int64
    
    0 讨论(0)
  • 2020-11-22 05:23

    In addition to all above answers, I would like to point out one more diffrence which I seem significant.

    You can correlate Panda's Datarame size and count with Java's Vectors size and length. When we create vector some predefined memory is allocated to it. when we reach closer to number of elements it can occupy while adding elements, more memory is allocated to it. Similarly, in DataFrame as we add elements, memory allocated to it increases.

    Size attribute gives number of memory cell allocated to DataFrame whereas count gives number of elements that are actually present in DataFrame. For example,

    You can see though there are 3 rows in DataFrame, its size is 6.

    This answer covers size and count difference with respect to DataFrame and not Pandas Series. I have not checked what happens with Series

    0 讨论(0)
提交回复
热议问题