I am a surgeon looking at neurosurgery. I have a dataframe of 600,000 records, and 70 columns with about 7 date columns for various events that happened to patients in a hospita
Edit: I've changed the filter criterion below to at least two different OPs of interest.
Here is one way to do this. I've changed your data somewhat for testing purposes.
import pandas as pd
df = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],
'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'],
'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12',
'2018-10-03', '2018-07-06', '2018-05-18']})
df.set_index('ID', inplace=True)
display(df)
OP_code OP_date
ID
1 V011 2014-12-12
2 A082 2014-06-23
999 V011 2014-08-07
3 V011 2014-09-12
1 A651 2018-10-03
999 V014 2018-07-06
2 A263 2018-05-18
First we should transform the data so that there is exactly one row per patient, collecting the data from multiple OPs in lists:
df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)
display(df_patients)
OP_code OP_date
ID
1 [V011, A651] [2014-12-12, 2018-10-03]
2 [A082, A263] [2014-06-23, 2018-05-18]
3 [V011] [2014-09-12]
999 [V011, V014] [2014-08-07, 2018-07-06]
Now given a list of the OP codes that correspond to the implants you're interested in, we can loop through the rows of this DataFrame to create an index of only those patients that had at least two different OPs of interest. Then we can filter the data according to this new index.
implant_codes = {'V011', 'V014'}
implant_index = []
for i in df_patients.index:
"""EDIT: filter criterion tightened to at least two different
relevant OPs, i.e. the intersection of the implant_codes
list with the patient's OP list has at least two elements."""
if len(implant_codes.intersection(df_patients.OP_code[i])) >= 2:
implant_index.append(i)
df_implants = df_patients.filter(implant_index, axis=0)
display(df_implants)
OP_code OP_date
ID
999 [V011, V014] [2014-08-07, 2018-07-06]
You can access data elements here by a combination of the indexing syntax for DataFrames and lists, e.g. df_implants.loc[999, 'OP_date'][0]
yields the first OP date of patient 999: '2014-08-07'
I would not recommend creating a separate column for each OP. You could try something like this:
df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(),
index=df_implants.index)
display(df_implants)
OP_code OP_date OP_date_1 OP_date_2
ID
999 [V011, V014] [2014-08-07, 2018-07-06] 2014-08-07 2018-07-06
However, this approach will run into trouble in practice, due to the fact that the number of OPs varies across patients. That's why I think the list representation given above is more natural and easier to handle.