Pivot a pandas DataFrame to be the correct format: `DataError: No numeric types to aggregate`

后端 未结 4 1780
青春惊慌失措
青春惊慌失措 2020-12-21 06:47

Here is a pandas DataFrame I would like to manipulate:

import pandas as pd

data = {\"grouping\": [\"item1\", \"item1\", \"item1\", \"item2\", \"item2\", \"         


        
相关标签:
4条回答
  • 2020-12-21 07:06

    Use set_index and unstack:

    df = df.set_index(['grouping','labels']).unstack().rename_axis(None)
    df.columns = df.columns.droplevel()
    print(df)
    

    Output:

    labels  A    B    C     D
    item1   5    1    8  None
    item2   3  731  189     9
    
    0 讨论(0)
  • 2020-12-21 07:15

    There are four idiomatic pandas ways to do this.

    • No duplicates among grouping columns. Does not require aggregation
      • pivot
      • set_index
    • Duplicates among grouping columns. Does require aggregation
      • pivot_table
      • groupby

    pivot

    df.pivot('grouping', 'labels', 'count')
    

    set_index

    df.set_index(['grouping', 'labels'])['count'].unstack()
    

    pivot_table

    df.pivot_table('count', 'grouping', 'labels')
    

    groupby

    df.groupby(['grouping', 'labels'])['count'].sum().unstack()
    

    All yield

    labels      A      B      C    D
    grouping                        
    item1     5.0    1.0    8.0  NaN
    item2     3.0  731.0  189.0  9.0
    

    timing

    With the groupby, set_index, or pivot_table approach, you can easily fill in missing values with fill_value=0

    df.pivot_table('count', 'grouping', 'labels', fill_value=0)
    
    df.groupby(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)
    
    df.set_index(['grouping', 'labels'])['count'].sum().unstack(fill_value=0)
    

    All yield

    labels    A    B    C  D
    grouping                
    item1     5    1    8  0
    item2     3  731  189  9
    

    Additional thoughts on groupby

    Because we don't require any aggregation. If we wanted to use groupby, we can minimize the impact of the implicit aggregation by utilizing a less impactful aggregator.

    df.groupby(['grouping', 'labels'])['count'].max().unstack()
    

    or

    df.groupby(['grouping', 'labels'])['count'].first().unstack()
    

    timing groupby

    0 讨论(0)
  • 2020-12-21 07:18

    Try:

    In [1]: import pandas as pd
       ...: 
       ...: data = {"grouping": ["item1", "item1", "item1", "item2", "item2", "item2", "item2"],
       ...:         "labels": ["A", "B", "C", "A", "B", "C", "D"],
       ...:         "count": [5, 1, 8, 3, 731, 189, 9]}
       ...: 
    In [2]: df = pd.DataFrame(data)
    In [3]: df.pivot_table(index="grouping",columns="labels")
    
    Out[3]: 
                 count              
        labels       A    B    C   D
        grouping                    
        item1        5    1    8 NaN
        item2        3  731  189   9
    
    0 讨论(0)
  • 2020-12-21 07:19

    You put labels in the index, but you want it in the columns:

    >>> df.pivot_table(index='grouping', columns='labels')
             count                   
    labels       A      B      C    D
    grouping                         
    item1      5.0    1.0    8.0  NaN
    item2      3.0  731.0  189.0  9.0
    

    Note that this makes the columns a MultiIndex. If you don't want that, explicitly pass values: df.pivot_table(index='grouping', columns='labels', values='count').

    Also, note that the kind of reshape you seem to be looking for will only be possible if each combination of grouping and label has exactly one or zero values. If any combination occurs more than once, you need to decide how to aggregate them (e.g., by summing the matching values).

    0 讨论(0)
提交回复
热议问题