How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

问题

I've got a table showing the accumulated number of hours (dataframe values) different specialists (ID) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this:

Input:

    ID  Task1   Task2   Task3   Task4
0   10      1       3       4       6
1   11      1       3       4       5
2   12      1       3       4       6

Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:

Desired output:

value   1       3       4       5       6
ID                  
10  Task1   Task2   Task3   Task3   Task4
11  Task1   Task2   Task3   Task4   Task4
12  Task1   Task2   Task3   Task3   Task4

With this particular dataframe, I've managed to produce the desired output using pd.melt(), pd.pivot() and pd.fillna() like this (complete snippet with sample data further down):

What I have tried:

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)

The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3 for ID=0 from 5 to 4:

Code 1

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Code 1 - Error:

ValueError: Index contains duplicate entries, cannot reshape

And according to the docs, pd.pivot_table() is a:

generalization of pivot that can handle duplicate values for one index/column pair.

So I was hoping that pd.pivot_table() would be better suited for this case. Alas, this triggers:

DataError: No numeric types to aggregate

Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table() the wrong way? I've also tried to include aggfunc=None.

I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot or pd.pivot_table and / or the shortest approach possible.

Complete working code example:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 5, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Complete example where both `df.pivot` and `pd.pivot_table` fail:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')

df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

回答1:

You can do this also using pd.crosstab:

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task2  Task2  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

Or changing the aggfunc to 'last':

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task3  Task3  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

回答2:

I'm pretty sure that this is not the best way to do this but it is one way.

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df1 = pd.melt(df, id_vars=['ID'], value_vars=df.columns[1:])
df1['value'] = df1['value'].astype(int)
df1.set_index(['ID','value'], inplace=True)

df_max_val = df.set_index('ID').max().max()
ids = df['ID'].tolist()*df_max_val
values = list(np.array([[i]*len(set(ids)) for i in range(1, df_max_val+1)]).flatten())
df2 = pd.DataFrame({'ID':ids,
                    'value':values})
df2.set_index(['ID','value'], inplace=True)

df3 = df2.merge(df1, left_index=True, right_index=True, how='outer')
df3 = df3.reset_index().drop_duplicates(subset=['ID','value'], keep='last')
df3 = pd.concat([df3[df3['ID']==i].fillna(method='ffill') for i in df3['ID'].unique()])
df3 = df3.pivot(index='ID', columns='value', values='variable')

来源：https://stackoverflow.com/questions/65974776/how-to-handle-valueerror-index-contains-duplicate-entries-using-df-pivot-or-pd

标签

python

pandas