问题
I've got a table showing the accumulated number of hours (dataframe values
) different specialists (ID
) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4']
like this:
Input:
ID Task1 Task2 Task3 Task4
0 10 1 3 4 6
1 11 1 3 4 5
2 12 1 3 4 6
Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:
Desired output:
value 1 3 4 5 6
ID
10 Task1 Task2 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
With this particular dataframe, I've managed to produce the desired output using pd.melt()
, pd.pivot()
and pd.fillna()
like this (complete snippet with sample data further down):
What I have tried:
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3
for ID=0
from 5 to 4
:
Code 1
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Code 1 - Error:
ValueError: Index contains duplicate entries, cannot reshape
And according to the docs, pd.pivot_table() is a:
generalization of pivot that can handle duplicate values for one index/column pair.
So I was hoping that pd.pivot_table()
would be better suited for this case. Alas, this triggers:
DataError: No numeric types to aggregate
Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table()
the wrong way? I've also tried to include aggfunc=None
.
I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot
or pd.pivot_table
and / or the shortest approach possible.
Complete working code example:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 5, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
Complete example where both df.pivot
and pd.pivot_table
fail:
import pandas as pd
df1 = pd.DataFrame({ 'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df
回答1:
You can do this also using pd.crosstab:
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task2 Task2 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
Or changing the aggfunc to 'last':
dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out
Output:
val 1 3 4 5 6
ID
10 Task1 Task1 Task3 Task3 Task4
11 Task1 Task2 Task3 Task4 Task4
12 Task1 Task2 Task3 Task3 Task4
回答2:
I'm pretty sure that this is not the best way to do this but it is one way.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},
'Task1': {0: 1, 1: 1, 2: 1},
'Task2': {0: 4, 1: 3, 2: 3},
'Task3': {0: 4, 1: 4, 2: 4},
'Task4': {0: 6, 1: 5, 2: 6}})
df1 = pd.melt(df, id_vars=['ID'], value_vars=df.columns[1:])
df1['value'] = df1['value'].astype(int)
df1.set_index(['ID','value'], inplace=True)
df_max_val = df.set_index('ID').max().max()
ids = df['ID'].tolist()*df_max_val
values = list(np.array([[i]*len(set(ids)) for i in range(1, df_max_val+1)]).flatten())
df2 = pd.DataFrame({'ID':ids,
'value':values})
df2.set_index(['ID','value'], inplace=True)
df3 = df2.merge(df1, left_index=True, right_index=True, how='outer')
df3 = df3.reset_index().drop_duplicates(subset=['ID','value'], keep='last')
df3 = pd.concat([df3[df3['ID']==i].fillna(method='ffill') for i in df3['ID'].unique()])
df3 = df3.pivot(index='ID', columns='value', values='variable')
来源:https://stackoverflow.com/questions/65974776/how-to-handle-valueerror-index-contains-duplicate-entries-using-df-pivot-or-pd