Simple cross-tabulation in pandas

后端 未结 2 435
忘掉有多难
忘掉有多难 2021-02-04 00:01

I stumbled across pandas and it looks ideal for simple calculations that I\'d like to do. I have a SAS background and was thinking it\'d replace proc freq -- it looks like it\'l

相关标签:
2条回答
  • 2021-02-04 00:44

    Assuming that you have a file called 2010.csv with contents

    category,value
    AB,100.00
    AB,200.00
    AC,150.00
    AD,500.00
    

    Then, using the ability to apply multiple aggregation functions following a groupby, you can say:

    import pandas
    data_2010 = pandas.read_csv("/path/to/2010.csv")
    data_2010.groupby("category").agg([len, sum])
    

    You should get a result that looks something like

              value     
                len  sum
    category            
    AB            2  300
    AC            1  150
    AD            1  500
    

    Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.

    0 讨论(0)
  • 2021-02-04 01:05

    v0.21 answer

    Use pivot_table with the index parameter:

    df.pivot_table(index='category', aggfunc=[len, sum])
    
               len   sum
             value value
    category            
    AB           2   300
    AC           1   150
    AD           1   500
    

    <= v0.12

    It is possible to do this using pivot_table for those interested:

    In [8]: df
    Out[8]: 
      category  value
    0       AB    100
    1       AB    200
    2       AC    150
    3       AD    500
    
    In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
    Out[9]: 
                len    sum
              value  value
    category              
    AB            2    300
    AC            1    150
    AD            1    500
    

    Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

    In [12]: df
    Out[12]: 
      category  value  value2
    0       AB    100       5
    1       AB    200       5
    2       AC    150       5
    3       AD    500       5
    
    In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
    Out[13]: 
                len            sum        
              value  value2  value  value2
    category                              
    AB            2       2    300      10
    AC            1       1    150       5
    AD            1       1    500       5
    

    The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

    0 讨论(0)
提交回复
热议问题