Simple cross-tabulation in pandas

后端未结

关注

 2  440

I stumbled across pandas and it looks ideal for simple calculations that I\'d like to do. I have a SAS background and was thinking it\'d replace proc freq -- it looks like it\'l

相关标签:

2条回答

滥情空心

2021-02-04 00:44
Assuming that you have a file called 2010.csv with contents
```
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
```
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
```
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
```
You should get a result that looks something like
```
          value     
            len  sum
category            
AB            2  300
AC            1  150
AD            1  500
```
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
0 讨论(0)
发布评论:

提交评论
- 加载中...

广开言路

2021-02-04 01:05

v0.21 answer

Use pivot_table with the index parameter:

df.pivot_table(index='category', aggfunc=[len, sum])

           len   sum
         value value
category            
AB           2   300
AC           1   150
AD           1   500

<= v0.12

It is possible to do this using pivot_table for those interested:

In [8]: df
Out[8]: 
  category  value
0       AB    100
1       AB    200
2       AC    150
3       AD    500

In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]: 
            len    sum
          value  value
category              
AB            2    300
AC            1    150
AD            1    500

Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:

In [12]: df
Out[12]: 
  category  value  value2
0       AB    100       5
1       AB    200       5
2       AC    150       5
3       AD    500       5

In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]: 
            len            sum        
          value  value2  value  value2
category                              
AB            2       2    300      10
AC            1       1    150       5
AD            1       1    500       5

The main reason to use __builtin__.sum vs. np.sum is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.

0 讨论(0)