Pandas - How to pick events by date and create a new ordered dataframe. - surgery patients

后端 未结 1 721
孤城傲影
孤城傲影 2021-01-27 06:17

I am a surgeon looking at neurosurgery. I have a dataframe of 600,000 records, and 70 columns with about 7 date columns for various events that happened to patients in a hospita

相关标签:
1条回答
  • 2021-01-27 06:36

    Edit: I've changed the filter criterion below to at least two different OPs of interest.

    Here is one way to do this. I've changed your data somewhat for testing purposes.

    import pandas as pd
    
    df = pd.DataFrame({'ID': [1, 2, 999, 3, 1, 999, 2],
                       'OP_code': ['V011', 'A082', 'V011', 'V011', 'A651', 'V014', 'A263'], 
                       'OP_date': ['2014-12-12', '2014-06-23', '2014-08-07', '2014-09-12', 
                                   '2018-10-03', '2018-07-06', '2018-05-18']})
    df.set_index('ID', inplace=True)
    display(df)
    
       OP_code     OP_date
    ID      
    1    V011   2014-12-12
    2    A082   2014-06-23
    999  V011   2014-08-07
    3    V011   2014-09-12
    1    A651   2018-10-03
    999  V014   2018-07-06
    2    A263   2018-05-18
    

    First we should transform the data so that there is exactly one row per patient, collecting the data from multiple OPs in lists:

    df_patients = pd.pivot_table(df, index=df.index, aggfunc=list)
    display(df_patients)
    
         OP_code        OP_date
    ID      
    1    [V011, A651]   [2014-12-12, 2018-10-03]
    2    [A082, A263]   [2014-06-23, 2018-05-18]
    3    [V011]         [2014-09-12]
    999  [V011, V014]   [2014-08-07, 2018-07-06]
    

    Now given a list of the OP codes that correspond to the implants you're interested in, we can loop through the rows of this DataFrame to create an index of only those patients that had at least two different OPs of interest. Then we can filter the data according to this new index.

    implant_codes = {'V011', 'V014'}
    
    implant_index = []
    for i in df_patients.index:
        """EDIT: filter criterion tightened to at least two different 
           relevant OPs, i.e. the intersection of the implant_codes 
           list with the patient's OP list has at least two elements."""
        if len(implant_codes.intersection(df_patients.OP_code[i])) >= 2: 
            implant_index.append(i)
    
    df_implants = df_patients.filter(implant_index, axis=0)
    display(df_implants)
    
         OP_code       OP_date
    ID      
    999  [V011, V014]  [2014-08-07, 2018-07-06]
    

    You can access data elements here by a combination of the indexing syntax for DataFrames and lists, e.g. df_implants.loc[999, 'OP_date'][0] yields the first OP date of patient 999: '2014-08-07'

    I would not recommend creating a separate column for each OP. You could try something like this:

    df_implants[['OP_date_1', 'OP_date_2']] = pd.DataFrame(df_implants.OP_date.values.tolist(), 
                                                           index=df_implants.index)
    display(df_implants)
    
         OP_code       OP_date                   OP_date_1   OP_date_2
    ID              
    999  [V011, V014]  [2014-08-07, 2018-07-06]  2014-08-07  2018-07-06
    

    However, this approach will run into trouble in practice, due to the fact that the number of OPs varies across patients. That's why I think the list representation given above is more natural and easier to handle.

    0 讨论(0)
提交回复
热议问题