count rows by certain combination of row values pandas

后端 未结 3 880
抹茶落季
抹茶落季 2020-12-21 06:52

I have a dataframe (df) like this:

  v1    v2  v3
   0    -30 -15
   0    -30 -7.5
   0    -30 -11.25
   0    -30 -13.125
   0    -30 -14.0625
   0    -30 -1         


        
相关标签:
3条回答
  • 2020-12-21 07:34

    Use groupby + count/size:

    df.groupby(['v1', 'v2']).v3.transform('count')
    
    0     6.0
    1     6.0
    2     6.0
    3     6.0
    4     6.0
    5     6.0
    6     6.0
    7     6.0
    8     6.0
    9     6.0
    10    6.0
    11    6.0
    12    3.0
    13    3.0
    14    3.0
    Name: v3, dtype: float64
    

    Use the mask to filter df:

    df = df[df.groupby(['v1', 'v2']).v3.transform('count').eq(6)]    # == 6
    df
    
        v1  v2        v3
    0    0 -30 -15.00000
    1    0 -30  -7.50000
    2    0 -30 -11.25000
    3    0 -30 -13.12500
    4    0 -30 -14.06250
    5    0 -30 -13.59375
    6    0 -10  -5.00000
    7    0 -10  -7.50000
    8    0 -10  -6.25000
    9    0 -10  -5.62500
    10   0 -10  -5.93750
    11   0 -10  -6.09375
    

    count does not count NaNs, while size does. Use whatever is appropriate for you.

    0 讨论(0)
  • 2020-12-21 07:37

    I think in v1 and v2 are no NaNs, so use transform + size:

    df = df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
    print (df)
        v1  v2        v3
    0    0 -30 -15.00000
    1    0 -30  -7.50000
    2    0 -30 -11.25000
    3    0 -30 -13.12500
    4    0 -30 -14.06250
    5    0 -30 -13.59375
    6    0 -10  -5.00000
    7    0 -10  -7.50000
    8    0 -10  -6.25000
    9    0 -10  -5.62500
    10   0 -10  -5.93750
    11   0 -10  -6.09375
    

    Detail:

    print (df.groupby(['v1', 'v2'])['v2'].transform('size') == 6)
    0      True
    1      True
    2      True
    3      True
    4      True
    5      True
    6      True
    7      True
    8      True
    9      True
    10     True
    11     True
    12    False
    13    False
    14    False
    Name: v2, dtype: bool
    

    Unfortunately filter is really slow, so if need better performance use transform:

    np.random.seed(123)
    N = 1000000
    L = list('abcdefghijkl') 
    df = pd.DataFrame({'v1': np.random.choice(L, N),
                       'v2':np.random.randint(10000,size=N),
                       'value':np.random.randint(1000,size=N),
                       'value2':np.random.randint(5000,size=N)})
    df = df.sort_values(['v1','v2']).reset_index(drop=True)
    print (df.head(10))
    
    In [290]: %timeit df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
    1 loop, best of 3: 12.1 s per loop
    
    In [291]: %timeit df[df.groupby(['v1', 'v2'])['v2'].transform('size') == 6]
    1 loop, best of 3: 176 ms per loop
    
    In [292]: %timeit df[df.groupby(['v1', 'v2']).v2.transform('count').eq(6)]
    10 loops, best of 3: 175 ms per loop
    

    N = 1000000
    
    ngroups = 1000
    
    df = pd.DataFrame(dict(A = np.random.randint(0,ngroups,size=N),B=np.random.randn(N)))
    
    In [299]: %timeit df.groupby('A').filter(lambda x: len(x) > 1000)
    1 loop, best of 3: 330 ms per loop
    
    In [300]: %timeit df[df.groupby(['A'])['A'].transform('size') > 1000]
    10 loops, best of 3: 101 ms per loop
    

    Caveat

    The results do not address performance given the number of groups, which will affect timings a lot for some of these solutions.

    0 讨论(0)
  • 2020-12-21 07:51

    You can use the filter groupby method:

    In [11]: df.groupby(['v1', 'v2']).filter(lambda x: len(x) == 6)
    Out[11]:
        v1  v2        v3
    0    0 -30 -15.00000
    1    0 -30  -7.50000
    2    0 -30 -11.25000
    3    0 -30 -13.12500
    4    0 -30 -14.06250
    5    0 -30 -13.59375
    6    0 -10  -5.00000
    7    0 -10  -7.50000
    8    0 -10  -6.25000
    9    0 -10  -5.62500
    10   0 -10  -5.93750
    11   0 -10  -6.09375
    
    0 讨论(0)
提交回复
热议问题