How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

我与影子孤独终老i 提交于 2021-02-19 04:08:31

问题


I've got a table showing the accumulated number of hours (dataframe values) different specialists (ID) have taken to complete a sequence of four tasks ['Task1, 'Tas2', 'Task3, 'Tas4'] like this:

Input:

    ID  Task1   Task2   Task3   Task4
0   10      1       3       4       6
1   11      1       3       4       5
2   12      1       3       4       6

Now I'd like to reshape that dataframe so that I can easily find out which task each specialist was working on after 1 hour, 2 hours, and so on. So the desired output looks like this:

Desired output:

value   1       3       4       5       6
ID                  
10  Task1   Task2   Task3   Task3   Task4
11  Task1   Task2   Task3   Task4   Task4
12  Task1   Task2   Task3   Task3   Task4

With this particular dataframe, I've managed to produce the desired output using pd.melt(), pd.pivot() and pd.fillna() like this (complete snippet with sample data further down):

What I have tried:

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)

The problem is that this approach is not very robust, in that it can easily collapse with a dataset that would render (I think) duplicate column names. Here's an example where that happens by just changing Task3 for ID=0 from 5 to 4:

Code 1

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Code 1 - Error:

ValueError: Index contains duplicate entries, cannot reshape

And according to the docs, pd.pivot_table() is a:

generalization of pivot that can handle duplicate values for one index/column pair.

So I was hoping that pd.pivot_table() would be better suited for this case. Alas, this triggers:

DataError: No numeric types to aggregate

Does anyone know if it's at all possible to obtain a robust way of handling these errors? Am I perhaps only using pd.pivot_table() the wrong way? I've also tried to include aggfunc=None.

I'm at a loss here, so any suggestions would be great! Although I'm hoping for an approach with df.pivot or pd.pivot_table and / or the shortest approach possible.

Complete working code example:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 5, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
df = df.pivot(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

Complete example where both df.pivot and pd.pivot_table fail:

import pandas as pd
df1 = pd.DataFrame({   'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df = pd.melt(df1, id_vars=['ID'], value_vars=df1.columns[1:])
# df = df.pivot(index='ID', columns = 'value', values = 'variable')

df = df.pivot_table(index='ID', columns = 'value', values = 'variable')
df = df.fillna(method = 'ffill', axis = 1)
df

回答1:


You can do this also using pd.crosstab:

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task2  Task2  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

Or changing the aggfunc to 'last':

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task3  Task3  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4



回答2:


I'm pretty sure that this is not the best way to do this but it is one way.

import pandas as pd
import numpy as np

df = pd.DataFrame({'ID': {0: 10, 1: 11, 2: 12},
                   'Task1': {0: 1, 1: 1, 2: 1},
                   'Task2': {0: 4, 1: 3, 2: 3},
                   'Task3': {0: 4, 1: 4, 2: 4},
                   'Task4': {0: 6, 1: 5, 2: 6}})

df1 = pd.melt(df, id_vars=['ID'], value_vars=df.columns[1:])
df1['value'] = df1['value'].astype(int)
df1.set_index(['ID','value'], inplace=True)

df_max_val = df.set_index('ID').max().max()
ids = df['ID'].tolist()*df_max_val
values = list(np.array([[i]*len(set(ids)) for i in range(1, df_max_val+1)]).flatten())
df2 = pd.DataFrame({'ID':ids,
                    'value':values})
df2.set_index(['ID','value'], inplace=True)

df3 = df2.merge(df1, left_index=True, right_index=True, how='outer')
df3 = df3.reset_index().drop_duplicates(subset=['ID','value'], keep='last')
df3 = pd.concat([df3[df3['ID']==i].fillna(method='ffill') for i in df3['ID'].unique()])
df3 = df3.pivot(index='ID', columns='value', values='variable')


来源:https://stackoverflow.com/questions/65974776/how-to-handle-valueerror-index-contains-duplicate-entries-using-df-pivot-or-pd

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!