Form groups of individuals python (pandas)

问题

I have a data set of the following form:

import pandas as pd
d1 = {'Subject': ['Subject1','Subject1','Subject1','Subject2','Subject2','Subject2','Subject3','Subject3','Subject3','Subject4','Subject4','Subject4'],
'Event':['1','2','3','1','2','3','1','2','3','1','2','3'],
'Category':['1','1','2','2','1','2','2','','2','1','1',''],
'Variable1':['1','2','3','4','5','6','7','8','9','10','11','12'],
'Variable2':['12','11','10','9','8','7','6','5','4','3','2','1'],
'Variable3': ['-6','-5','-4','-3','-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)
d1=d1[['Subject','Event','Category','Variable1','Variable2','Variable3']]
d1

This looks as follows:

Where
1) 'Subject' is the subject level identifier.
2) 'Event'is the event level identifier.
3) 'Category' is the category level identifier.
4) Variable1, Variable2 & Variable3 are some continuous variables for each subject.

I need to make all feasible groups of 2 for 'Subject' for 'Event' for each 'Category'.

For instance, for Event 1, the only possible pairs are: 1) Subject1 - Subject4 (For Category 1) 2) Subject2 - Subject3 (For Category 2)

Note, if a category value is missing, then this indicates the 'Subject' is to be considered to have not taken part in the event.

After forming each possible group, I have to take the Variable1, Variable2 and Variable3 for both 'Subject' and put them side by side.

This should look like the following:

What is important is to maintain the order in which 'Subject' appears under Match1 and Match2 columns and the ordering of Variable1, Variable2,Variable3 columns.

The possible pairingsfor Event 2 is shown below:

Note since for Subject3, Category is blank, she does not appear in the pairings.

Similarly, the possible pairings for Event 3 is shown below: Note since for Subject4, Category is blank, she does not appear in the pairings.

The final table looks like this:

Note that all numbers are random. In the actual dataset, I have about 15 categories each with about 1000 subjects spanning across 300 events. In some cases, some categories may have no observations for an event just as shown here.

Please let me know if you my question is not very clear or if I made a mistake in the pair examples here.

Any help will be appreciated. Thanks in advance.

回答1:

Use:

from  itertools import combinations

d1['Category'] = d1['Category'].mask(d1['Category'] == '')

L = [(i[0], i[1], y[0], y[1]) for i, x in d1.groupby(['Event','Category'])['Subject'] 
                              for y in list(combinations(x, 2))]
df = pd.DataFrame(L, columns=['Event','Category','Match1','Match2'])

df1 = (df.rename(columns={'Match1':'Subject'})
         .merge(d1, on=['Event','Category','Subject'], how='left')
         .iloc[:, 4:]
         .add_suffix('.1'))
df2 = (df.rename(columns={'Match2':'Subject'})
         .merge(d1, on=['Event','Category','Subject'], how='left')
         .iloc[:, 4:]
         .add_suffix('.2'))

fin = pd.concat([df, df1, df2], axis=1)

print (fin)
  Event Category    Match1    Match2 Variable1.1 Variable2.1 Variable3.1  \
0     1        1  Subject1  Subject4           1          12          -6   
1     1        2  Subject2  Subject3           4           9          -3   
2     2        1  Subject1  Subject2           2          11          -5   
3     2        1  Subject1  Subject4           2          11          -5   
4     2        1  Subject2  Subject4           5           8          -4   
5     3        2  Subject1  Subject2           3          10          -4   
6     3        2  Subject1  Subject3           3          10          -4   
7     3        2  Subject2  Subject3           6           7          -3   

  Variable1.2 Variable2.2 Variable3.2  
0          10           3           1  
1           7           6          -2  
2           5           8          -4  
3          11           2           2  
4          11           2           2  
5           6           7          -3  
6           9           4           0  
7           9           4           0

Explanation:

Replace empty strings to NaNs by mask- groupby siletly remove these rows
Create DataFrame by list comprehension with flattening of all combinations of length 2 of column Subject by groups per columns Event and Category
Double join variable columns by merge with left join, filter out first 4 columns by positions by iloc and add add_suffix or add_prefix for avoid duplicated columns names
Last concat all 3 DataFrames together

来源：https://stackoverflow.com/questions/49968861/form-groups-of-individuals-python-pandas

标签

python

python-2.7

pandas

data-manipulation