Assign control vs. treatment groupings randomly based on % for more than 2 groups

我只是一个虾纸丫 提交于 2019-12-01 08:09:40

问题


Piggy backing off my own previous question python pandas: assign control vs. treatment groupings randomly based on %

Thanks to @maxU, I know how to assign random control/treatment groupings to 2 groups; but what if I have 3 groups or more?

For example:

df.head()

customer_id | Group | many other columns
ABC             1
CDE             3
BHF             2
NID             1
WKL             3
SDI             2
JSK             1
OSM             3
MPA             2
MAD             1

pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))

Group 1  : 270
Group 2  : 180
Group 3  : 330

I have a great answer, when I only have two groups:

df['Flag'] = df.groupby('Group')['customer_id']\
             .transform(lambda x: np.random.choice(['Control','Test'], len(x), 
                                                  p=[.5,.5] if x.name==1 else [.4,.6]))

But what if i want to split it this way:

  • Group 1: 50% Control & 50% Test
  • Group 2: 40% Control & 60% Test
  • Group 3: 20% Control & 80% Test

@MaxU's answer is great, but unfortunately the split is not exact

d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}

df['Flag'] = df.groupby('Group')['customer_id'] \
             .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))

When i test it, I don't get exact splits.

pd.pivot_table(df,index=['Group'],values=["customer_id"],columns=['Flag'], aggfunc=lambda x: len(x.unique()))

           Control  Treatment
Group 1:    138       132
Group 2:    78        102
Group 3:    79        251

Group 1 should be 135/135.


回答1:


It sounds like you're looking for a way to split your customer_id's into exact proportions, and not rely on chance. Here's one way to do that using pandas.qcut and np.random.permutation.

In [228]: df = pd.DataFrame({'customer_id': np.random.normal(size=10000), 
                             'group': np.random.choice(['a', 'b', 'c'], size=10000)})

In [229]: proportions = {'a':[.5,.5], 'b':[.4,.6], 'c':[.2,.8]}

In [230]: df.head()
Out[230]:
   customer_id group
0       0.6547     c
1       1.4190     a
2       0.4205     a
3       2.3266     a
4      -0.5691     b

In [231]: def assigner(gp):
     ...:     group = gp['group'].iloc[0]
     ...:     cut = pd.qcut(
                  np.arange(gp.shape[0]), 
                  q=np.cumsum([0] + proportions[group]), 
                  labels=range(len(proportions[group]))
              ).get_values()
     ...:     return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='assignment')
     ...:

In [232]: df['assignment'] = df.groupby('group', group_keys=False).apply(assigner)

In [233]: df.head()
Out[233]:
   customer_id group  assignment
0       0.6547     c           1
1       1.4190     a           1
2       0.4205     a           0
3       2.3266     a           1
4      -0.5691     b           0

In [234]: (df.groupby(['group', 'assignment'])
             .size()
             .unstack()
             .assign(proportion=lambda x: x[0] / (x[0] + x[1])))
Out[234]:
assignment     0     1  proportion
group
a           1659  1658      0.5002
b           1335  2003      0.3999
c            669  2676      0.2000

What's going on here?

  1. Within each group we call the function assigner
  2. assigner grabs the group name and proportions from the predefined dictionary and calls pd.qcut to split into 0(control) 1(treatment)
  3. np.random.permutation then shuffles the the assignments
  4. Create this as a new column in the original dataframe



回答2:


In [13]: df
Out[13]:
  customer_id  Group
0         ABC      1
1         CDE      3
2         BHF      2
3         NID      1
4         WKL      3
5         SDI      2
6         JSK      1
7         OSM      3
8         MPA      2
9         MAD      1

In [14]: d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}

In [15]: df['Flag'] = \
    ...: df.groupby('Group')['customer_id'] \
    ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
    ...:

In [16]: df
Out[16]:
  customer_id  Group     Flag
0         ABC      1  Control
1         CDE      3     Test
2         BHF      2     Test
3         NID      1  Control
4         WKL      3  Control
5         SDI      2     Test
6         JSK      1     Test
7         OSM      3     Test
8         MPA      2  Control
9         MAD      1     Test


来源:https://stackoverflow.com/questions/46552395/assign-control-vs-treatment-groupings-randomly-based-on-for-more-than-2-group

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!