Consider the following dataframe:
A B C D
0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.8338
I am going to use a very simple snippet to illustrate the difference:
test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']
The DataFrame looks like this:
id price
0 1 1
1 2 2
2 3 3
3 1 2
4 2 3
5 3 1
6 1 3
7 2 1
8 3 2
There are 3 customer IDs in this table, each customer made three transactions and paid 1,2,3 dollars each time.
Now, I want to find the minimum payment made by each customer. There are two ways of doing it:
Using apply
:
grouping.min()
The return looks like this:
id
1 1
2 1
3 1
Name: price, dtype: int64
pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3
Using transform
:
grouping.transform(min)
The return looks like this:
0 1
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
Name: price, dtype: int64
pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9
Both methods return a Series
object, but the length
of the first one is 3 and the length
of the second one is 9.
If you want to answer What is the minimum price paid by each customer
, then the apply
method is the more suitable one to choose.
If you want to answer What is the difference between the amount paid for each transaction vs the minimum payment
, then you want to use transform
, because:
test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row
Apply
does not work here simply because it returns a Series of size 3, but the original df's length is 9. You cannot integrate it back to the original df easily.