I stumbled across pandas and it looks ideal for simple calculations that I\'d like to do. I have a SAS background and was thinking it\'d replace proc freq -- it looks like it\'l
Assuming that you have a file called 2010.csv with contents
category,value
AB,100.00
AB,200.00
AC,150.00
AD,500.00
Then, using the ability to apply multiple aggregation functions following a groupby, you can say:
import pandas
data_2010 = pandas.read_csv("/path/to/2010.csv")
data_2010.groupby("category").agg([len, sum])
You should get a result that looks something like
value
len sum
category
AB 2 300
AC 1 150
AD 1 500
Note that Wes will likely come by to point out that sum is optimized and that you should probably use np.sum.
v0.21
answer
Use pivot_table
with the index
parameter:
df.pivot_table(index='category', aggfunc=[len, sum])
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
<= v0.12
It is possible to do this using pivot_table
for those interested:
In [8]: df
Out[8]:
category value
0 AB 100
1 AB 200
2 AC 150
3 AD 500
In [9]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[9]:
len sum
value value
category
AB 2 300
AC 1 150
AD 1 500
Note that the result's columns are hierarchically indexed. If you had multiple data columns, you would get a result like this:
In [12]: df
Out[12]:
category value value2
0 AB 100 5
1 AB 200 5
2 AC 150 5
3 AD 500 5
In [13]: df.pivot_table(rows='category', aggfunc=[len, np.sum])
Out[13]:
len sum
value value2 value value2
category
AB 2 2 300 10
AC 1 1 150 5
AD 1 1 500 5
The main reason to use __builtin__.sum
vs. np.sum
is that you get NA-handling from the latter. Probably could intercept the Python built-in, will make a note about that now.