How to make a pandas crosstab with percentages?

前端 未结 6 654
时光说笑
时光说笑 2021-01-30 01:44

Given a dataframe with different categorical variables, how do I return a cross-tabulation with percentages instead of frequencies?

df = pd.DataFrame({\'A\' : [\         


        
相关标签:
6条回答
  • 2021-01-30 01:55

    If you're looking for a percentage of the total, you can divide by the len of the df instead of the row sum:

    pd.crosstab(df.A, df.B).apply(lambda r: r/len(df), axis=1)
    
    0 讨论(0)
  • 2021-01-30 01:56
    pd.crosstab(df.A, df.B).apply(lambda r: r/r.sum(), axis=1)
    

    Basically you just have the function that does row/row.sum(), and you use apply with axis=1 to apply it by row.

    (If doing this in Python 2, you should use from __future__ import division to make sure division always returns a float.)

    0 讨论(0)
  • 2021-01-30 02:00

    From Pandas 0.18.1 onwards, there's a normalize option:

    In [1]: pd.crosstab(df.A,df.B, normalize='index')
    Out[1]:
    
    B              A           B           C
    A           
    one     0.333333    0.333333    0.333333
    three   0.333333    0.333333    0.333333
    two     0.333333    0.333333    0.333333
    

    Where you can normalise across either all, index (rows), or columns.

    More details are available in the documentation.

    0 讨论(0)
  • Normalizing the index will simply work out. Use parameter, normalize = "index" in pd.crosstab().

    0 讨论(0)
  • 2021-01-30 02:13

    We can show it as percentages by multiplying by 100:

    pd.crosstab(df.A,df.B, normalize='index')\
        .round(4)*100
    
    B          A      B      C
    A                         
    one    33.33  33.33  33.33
    three  33.33  33.33  33.33
    two    33.33  33.33  33.33
    

    Where I've rounded for convenience.

    0 讨论(0)
  • 2021-01-30 02:22

    Another option is to use div rather than apply:

    In [11]: res = pd.crosstab(df.A, df.B)
    

    Divide by the sum over the index:

    In [12]: res.sum(axis=1)
    Out[12]: 
    A
    one      12
    three     6
    two       6
    dtype: int64
    

    Similar to above, you need to do something about integer division (I use astype('float')):

    In [13]: res.astype('float').div(res.sum(axis=1), axis=0)
    Out[13]: 
    B             A         B         C
    A                                  
    one    0.333333  0.333333  0.333333
    three  0.333333  0.333333  0.333333
    two    0.333333  0.333333  0.333333
    
    0 讨论(0)
提交回复
热议问题