Pandas: how to create a running count column?

懵懂的女人 提交于 2020-01-24 20:50:09

问题


I have a flat text file of the form (column headers added by me)

CASE        Diagnosis
  S1 no diagnosis
  S2 fungus
     squamous lesion
  S3 fungus
  S4 squamous lesion
     glandular lesion
     atypia

I would like to stack and unstack cases with multiple diagnoses, so I would like

CASE DxN         Diagnosis
  S1 A   no diagnosis
  S2 A   fungus   
     B   squamous lesion
  S3 A   fungus
  S4 A   squamous lesion
     B   glandular lesion
     C   atypia

and

CASE                 A                 B       C
  S1 no diagnosis
  S2 fungus             squamous lesion
  S3 fungus
  S4 squamous lesion    glandular lesion  atypia

how do I make that subseries DxN? The count should never be greater than F. Even if there were 10,000 possible answers, there is never more than 6 per case, so no more than 6 columns. I just want "What is diagnosis A for case S1, what's diagnosis B for case S1, what's diagnosis 3 for case S1?" I don't want a column for every possible answer.


回答1:


Is this what you need ?

    df=df.replace('',np.nan).ffill()
    df.assign(DxN=df.groupby('CASE').cumcount()).set_index(['CASE','DxN']).Diagnosis.unstack(fill_value='')
    Out[709]: 
    DxN                0                1
    CASE                                 
    S1       nodiagnosis                 
    S2            fungus   squamouslesion
    S3            fungus                 
    S4    squamouslesion  glandularlesion



回答2:


Here is one method, starting with the data in the text format you have:

import pandas as pd
import numpy as np

df = pd.DataFrame([['S1', 'no diagnosis'],
                   ['S2', 'fungus'],
                   ['', 'squamous lesion'],
                   ['S3', 'fungus'],
                   ['S4', 'squamous lesion'],
                   ['', 'glandular lesion']],
                  columns=['CASE', 'Diagnosis'])

# front fill CASE series
df.CASE = df.CASE.replace('', np.nan).ffill()

# pivot data
df = pd.pivot_table(df, index=['CASE'], values=['Diagnosis'],
                    aggfunc=lambda x: list(x)).reset_index()

# split columns of lists into separate columns
df = pd.concat([df[['CASE']], pd.DataFrame(df['Diagnosis'].values.tolist())], axis=1)

#   CASE                0                 1
# 0   S1     no diagnosis              None
# 1   S2           fungus   squamous lesion
# 2   S3           fungus              None
# 3   S4  squamous lesion  glandular lesion



回答3:


You can create a column with the running total of diagnoses for each case. See this post for more details: SQL-like window functions in PANDAS: Row Numbering in Python Pandas Dataframe

With this sample data:

df = pd.DataFrame([
    {'Case': 'S1', 'Diagnosis': 'no diagnosis'},
    {'Case': 'S2', 'Diagnosis': 'fungus'},
    {'Case': 'S2', 'Diagnosis': 'squamous lesion'}
])

This script will give you the running total:

df['DxN'] = df.sort_values(['Case'], ascending=[1]).groupby('Case').cumcount() + 1


来源:https://stackoverflow.com/questions/48588960/pandas-how-to-create-a-running-count-column

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!