Pandas - How to pick events by date and create a new ordered dataframe. - surgery patients

后端未结

关注

 1  721

I am a surgeon looking at neurosurgery. I have a dataframe of 600,000 records, and 70 columns with about 7 date columns for various events that happened to patients in a hospita

相关标签:

1条回答

刺人心

2021-01-27 06:36

Edit: I've changed the filter criterion below to at least two different OPs of interest.

Here is one way to do this. I've changed your data somewhat for testing purposes.

import pandas as pd

df = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],
                   'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'], 
                   'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12', 
                               '2018-10-03', '2018-07-06', '2018-05-18']})
df.set_index('ID', inplace=True)
display(df)

   OP_code     OP_date
ID      
1    V011   2014-12-12
2    A082   2014-06-23
999  V011   2014-08-07
3    V011   2014-09-12
1    A651   2018-10-03
999  V014   2018-07-06
2    A263   2018-05-18

First we should transform the data so that there is exactly one row per patient, collecting the data from multiple OPs in lists:

df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)
display(df_patients)

     OP_code        OP_date
ID      
1    [V011, A651]   [2014-12-12, 2018-10-03]
2    [A082, A263]   [2014-06-23, 2018-05-18]
3    [V011]         [2014-09-12]
999  [V011, V014]   [2014-08-07, 2018-07-06]

Now given a list of the OP codes that correspond to the implants you're interested in, we can loop through the rows of this DataFrame to create an index of only those patients that had at least two different OPs of interest. Then we can filter the data according to this new index.

implant_codes = {'V011', 'V014'}

implant_index = []
for i in df_patients.index:
    """EDIT: filter criterion tightened to at least two different 
       relevant OPs, i.e. the intersection of the implant_codes 
       list with the patient's OP list has at least two elements."""
    if len(implant_codes.intersection(df_patients.OP_code[i])) >= 2: 
        implant_index.append(i)

df_implants = df_patients.filter(implant_index, axis=0)
display(df_implants)

     OP_code       OP_date
ID      
999  [V011, V014]  [2014-08-07, 2018-07-06]

You can access data elements here by a combination of the indexing syntax for DataFrames and lists, e.g. df_implants.loc[999, 'OP_date'][0] yields the first OP date of patient 999: '2014-08-07'

I would not recommend creating a separate column for each OP. You could try something like this:

df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(), 
                                                       index=df_implants.index)
display(df_implants)

     OP_code       OP_date                   OP_date_1   OP_date_2
ID              
999  [V011, V014]  [2014-08-07, 2018-07-06]  2014-08-07  2018-07-06

However, this approach will run into trouble in practice, due to the fact that the number of OPs varies across patients. That's why I think the list representation given above is more natural and easier to handle.

0 讨论(0)