问题
I have a data set of the following form:
import pandas as pd
d1 = {'Subject': ['Subject1','Subject1','Subject1','Subject2','Subject2','Subject2','Subject3','Subject3','Subject3','Subject4','Subject4','Subject4'],
'Event':['1','2','3','1','2','3','1','2','3','1','2','3'],
'Category':['1','1','2','2','1','2','2','','2','1','1',''],
'Variable1':['1','2','3','4','5','6','7','8','9','10','11','12'],
'Variable2':['12','11','10','9','8','7','6','5','4','3','2','1'],
'Variable3': ['-6','-5','-4','-3','-4','-3','-2','-1','0','1','2','3']}
d1 = pd.DataFrame(d1)
d1=d1[['Subject','Event','Category','Variable1','Variable2','Variable3']]
d1
This looks as follows:
Where
1) 'Subject' is the subject level identifier.
2) 'Event'is the event level identifier.
3) 'Category' is the category level identifier.
4) Variable1, Variable2 & Variable3 are some continuous variables for each subject.
I need to make all feasible groups of 2 for 'Subject' for 'Event' for each 'Category'.
For instance, for Event 1, the only possible pairs are: 1) Subject1 - Subject4 (For Category 1) 2) Subject2 - Subject3 (For Category 2)
Note, if a category value is missing, then this indicates the 'Subject' is to be considered to have not taken part in the event.
After forming each possible group, I have to take the Variable1, Variable2 and Variable3 for both 'Subject' and put them side by side.
This should look like the following:
What is important is to maintain the order in which 'Subject' appears under Match1 and Match2 columns and the ordering of Variable1, Variable2,Variable3 columns.
The possible pairingsfor Event 2 is shown below:
Note since for Subject3, Category is blank, she does not appear in the pairings.
Similarly, the possible pairings for Event 3 is shown below: Note since for Subject4, Category is blank, she does not appear in the pairings.
The final table looks like this:
Note that all numbers are random. In the actual dataset, I have about 15 categories each with about 1000 subjects spanning across 300 events. In some cases, some categories may have no observations for an event just as shown here.
Please let me know if you my question is not very clear or if I made a mistake in the pair examples here.
Any help will be appreciated. Thanks in advance.
回答1:
Use:
from itertools import combinations
d1['Category'] = d1['Category'].mask(d1['Category'] == '')
L = [(i[0], i[1], y[0], y[1]) for i, x in d1.groupby(['Event','Category'])['Subject']
for y in list(combinations(x, 2))]
df = pd.DataFrame(L, columns=['Event','Category','Match1','Match2'])
df1 = (df.rename(columns={'Match1':'Subject'})
.merge(d1, on=['Event','Category','Subject'], how='left')
.iloc[:, 4:]
.add_suffix('.1'))
df2 = (df.rename(columns={'Match2':'Subject'})
.merge(d1, on=['Event','Category','Subject'], how='left')
.iloc[:, 4:]
.add_suffix('.2'))
fin = pd.concat([df, df1, df2], axis=1)
print (fin)
Event Category Match1 Match2 Variable1.1 Variable2.1 Variable3.1 \
0 1 1 Subject1 Subject4 1 12 -6
1 1 2 Subject2 Subject3 4 9 -3
2 2 1 Subject1 Subject2 2 11 -5
3 2 1 Subject1 Subject4 2 11 -5
4 2 1 Subject2 Subject4 5 8 -4
5 3 2 Subject1 Subject2 3 10 -4
6 3 2 Subject1 Subject3 3 10 -4
7 3 2 Subject2 Subject3 6 7 -3
Variable1.2 Variable2.2 Variable3.2
0 10 3 1
1 7 6 -2
2 5 8 -4
3 11 2 2
4 11 2 2
5 6 7 -3
6 9 4 0
7 9 4 0
Explanation:
- Replace empty strings to NaNs by mask-
groupby
siletly remove these rows - Create
DataFrame
by list comprehension with flattening of all combinations of length2
of columnSubject
by groups per columnsEvent
andCategory
- Double join variable columns by merge with left join, filter out first
4
columns by positions by iloc and add add_suffix or add_prefix for avoid duplicated columns names - Last concat all 3 DataFrames together
来源:https://stackoverflow.com/questions/49968861/form-groups-of-individuals-python-pandas