Drop rows corresponding to groups smaller than specified size

前端 未结 2 764
滥情空心
滥情空心 2021-01-22 20:20

I have a DataFrame of answers for 100 questions_id and 50 user_id\'s. Each row represents a single question from a specific user. The ta

2条回答
  •  北海茫月
    2021-01-22 20:51

    Use boolean indexing for filter only rows with counts more like 100 times, transform with size is for return Series with same size like original DataFrame:

    df1 = df[df.groupby('user_id')['question_id'].transform('size') > 100]
    

    Performance: Depends of number of rows and length of groups, so best test in real data:

    np.random.seed(123)
    N = 1000000
    L = list('abcde') 
    df = pd.DataFrame({'question_id': np.random.choice(L, N, p=(.75,.0001,.0005,.0005,.2489)),
                       'user_id':np.random.randint(10000,size=N)})
    df = df.sort_values(['user_id','question_id']).reset_index(drop=True)
    
    In [176]: %timeit df[df.groupby('user_id')['question_id'].transform('size') > 100]
    74.8 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    #coldspeed solutions
    In [177]: %timeit df.groupby('user_id').filter(lambda x: len(x) > 100)
    1.4 s ± 44.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [178]: %%timeit
         ...: m = dict(zip(*np.unique(df.user_id, return_counts=True)))
         ...: df[df['user_id'].map(m) > 100]
         ...: 
    89.2 ms ± 3.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

提交回复
热议问题