Pandas: Selecting rows for which groupby.sum() satisfies condition

前端 未结 3 767
轻奢々
轻奢々 2021-01-19 05:49

In pandas I have a dataframe of the form:

>>> import pandas as pd  
>>> df = pd.DataFrame({\'ID\':[51,51,51,24,24,24,31], \'x\':[0,1,0,0,1,         


        
相关标签:
3条回答
  • 2021-01-19 05:53

    Use groupby and filter

    df.groupby('ID').filter(lambda s: s.x.sum()>=2)
    

    Output:

       ID  x
    3  24  0
    4  24  1
    5  24  1
    
    0 讨论(0)
  • 2021-01-19 06:04
    df = pd.DataFrame({'ID':[51,51,51,24,24,24,31], 'x':[0,1,0,0,1,1,0]})
    df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2,:]
    out:
       ID  x
    3  24  0
    4  24  1
    5  24  1
    
    0 讨论(0)
  • Using np.bincount and pd.factorize
    alternative advance technique to draw better performance

    f, u = df.ID.factorize()
    df[np.bincount(f, df.x.values)[f] >= 2]
    
       ID  x
    3  24  0
    4  24  1
    5  24  1
    

    In obnoxious one-liner form

    df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
    
       ID  x
    3  24  0
    4  24  1
    5  24  1
    

    np.bincount and np.unique
    I could've used np.unique with the return_inverse parameter to accomplish the same exact thing. But, np.unique will sort the array and will change the time complexity of the solution.

    u, f = np.unique(df.ID.values, return_inverse=True)
    df[np.bincount(f, df.x.values)[f] >= 2]
    

    One-liner

    df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
    

    Timing

    %timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(df.ID.factorize()[0], df.x.values)]
    %timeit df[(lambda f, w: np.bincount(f, w)[f] >= 2)(np.unique(df.ID.values, return_inverse=True)[1], df.x.values)]
    %timeit df.groupby('ID').filter(lambda s: s.x.sum()>=2)
    %timeit df.loc[df.groupby(['ID'])['x'].transform(func=sum)>=2]
    %timeit df.loc[df.groupby(['ID'])['x'].transform('sum')>=2]
    

    small data

    1000 loops, best of 3: 302 µs per loop
    1000 loops, best of 3: 241 µs per loop
    1000 loops, best of 3: 1.52 ms per loop
    1000 loops, best of 3: 1.2 ms per loop
    1000 loops, best of 3: 1.21 ms per loop
    

    large data

    np.random.seed([3,1415])
    df = pd.DataFrame(dict(
            ID=np.random.randint(100, size=10000),
            x=np.random.randint(2, size=10000)
        ))
    
    1000 loops, best of 3: 528 µs per loop
    1000 loops, best of 3: 847 µs per loop
    10 loops, best of 3: 20.9 ms per loop
    1000 loops, best of 3: 1.47 ms per loop
    1000 loops, best of 3: 1.55 ms per loop
    

    larger data

    np.random.seed([3,1415])
    df = pd.DataFrame(dict(
            ID=np.random.randint(100, size=100000),
            x=np.random.randint(2, size=100000)
        ))
    
    1000 loops, best of 3: 2.01 ms per loop
    100 loops, best of 3: 6.44 ms per loop
    10 loops, best of 3: 29.4 ms per loop
    100 loops, best of 3: 3.84 ms per loop
    100 loops, best of 3: 3.74 ms per loop
    
    0 讨论(0)
提交回复
热议问题