问题
Using pandas, I want to convert a long data frame to wide but the usual pivot
method is not as flexible as I need.
Here is the long data:
raw = {
'sample':[1, 1, 1, 1, 2, 2, 3, 3, 3, 3],
'gene':['G1', 'G2', 'G3', 'G3', 'G1', 'G2', 'G2', 'G2', 'G3', 'G3'],
'type':['HIGH', 'HIGH', 'LOW', 'MED', 'HIGH', 'LOW', 'LOW', 'LOW', 'MED', 'LOW']}
df = pd.DataFrame(raw)`
which produces
gene sample type
G1 1 HIGH
G2 1 HIGH
G3 1 LOW
G3 1 MED
G1 2 HIGH
G2 2 LOW
G2 3 LOW
G2 3 LOW
G3 3 MED
G3 3 LOW
What I want is a data frame that has rows as gene
and columns as sample
, but I want the cell value to be filled with the "greatest" type
according to HIGH
> MED
> LOW
> NONE
i.e. it should look like
casted = {
'gene':['G1', 'G2', 'G3'],
'1':['HIGH', 'HIGH', 'MED'],
'2':['HIGH', 'LOW', 'NONE'],
'3':['NONE', 'LOW', 'MED']
}
dfCast = pd.DataFrame(casted)
which makes
1 2 3 gene
HIGH HIGH NONE G1
HIGH LOW LOW G2
MED NONE MED G3
Trivially and erroneously, my long to wide command would look like
df = df.pivot(index='gene', columns = 'sample', values='type')
but of course this doesn't account for the hierarchy I want to impose where HIGH
>MED
>LOW
>NONE
When casting, how can I control what the cell value is?
回答1:
You can use pivot_table
which provides an aggfun
method to aggregate duplicated index-column values; To sort the keywords HIGH,MED,LOW
in an order you need, set them as keys of a dictionary whose values go in monotonic order, and pick the extreme value with min/max
as the aggregation function:
cat = {"HIGH": 3, "MED": 2, "LOW": 1}
df.pivot_table("type", "gene", "sample", aggfunc=lambda x: max(x, key=cat.get))
Or another option, convert the type to ordered categorical data type and then use pivot_table
:
df['type'] = pd.Categorical(df['type'], ["LOW", "MED", "HIGH"], ordered=True)
df.pivot_table("type", "gene", "sample", aggfunc='max')
来源:https://stackoverflow.com/questions/42310781/pandas-long-to-wide