Suppose I have a df
which has columns of \'ID\', \'col_1\', \'col_2\'
. And I define a function :
f = lambda x, y : my_function_expres
Returning a list from apply
is a dangerous operation as the resulting object is not guaranteed to be either a Series or a DataFrame. And exceptions might be raised in certain cases. Let's walk through a simple example:
df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
columns=['a', 'b', 'c'])
df
a b c
0 4 0 0
1 2 0 1
2 2 2 2
3 1 2 2
4 3 0 0
There are three possible outcomes with returning a list from apply
1) If the length of the returned list is not equal to the number of columns, then a Series of lists is returned.
df.apply(lambda x: list(range(2)), axis=1) # returns a Series
0 [0, 1]
1 [0, 1]
2 [0, 1]
3 [0, 1]
4 [0, 1]
dtype: object
2) When the length of the returned list is equal to the number of columns then a DataFrame is returned and each column gets the corresponding value in the list.
df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
a b c
0 0 1 2
1 0 1 2
2 0 1 2
3 0 1 2
4 0 1 2
3) If the length of the returned list equals the number of columns for the first row but has at least one row where the list has a different number of elements than number of columns a ValueError is raised.
i = 0
def f(x):
global i
if i == 0:
i += 1
return list(range(3))
return list(range(4))
df.apply(f, axis=1)
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)
Using apply
with axis=1 is very slow. It is possible to get much better performance (especially on larger datasets) with basic iterative methods.
Create larger dataframe
df1 = df.sample(100000, replace=True).reset_index(drop=True)
# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
@Thomas answer
%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
A interesting question! my answer as below:
import pandas as pd
def sublst(row):
return lst[row['J1']:row['J2']]
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(sublst,axis=1)
print df
Output:
ID J1 J2
0 1 0 1
1 2 2 4
2 3 3 5
ID J1 J2 J3
0 1 0 1 [a]
1 2 2 4 [c, d]
2 3 3 5 [d, e]
I changed the column name to ID,J1,J2,J3 to ensure ID < J1 < J2 < J3, so the column display in right sequence.
One more brief version:
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']
df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df
I'm going to put in a vote for np.vectorize. It allows you to just shoot over x number of columns and not deal with the dataframe in the function, so it's great for functions you don't control or doing something like sending 2 columns and a constant into a function (i.e. col_1, col_2, 'foo').
import numpy as np
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below
df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])
df
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
Here's an example using apply
on the dataframe, which I am calling with axis = 1
.
Note the difference is that instead of trying to pass two values to the function f
, rewrite the function to accept a pandas Series object, and then index the Series to get the values needed.
In [49]: df
Out[49]:
0 1
0 1.000000 0.000000
1 -0.494375 0.570994
2 1.000000 0.000000
3 1.876360 -0.229738
4 1.000000 0.000000
In [50]: def f(x):
....: return x[0] + x[1]
....:
In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]:
0 1.000000
1 0.076619
2 1.000000
3 1.646622
4 1.000000
Depending on your use case, it is sometimes helpful to create a pandas group
object, and then use apply
on the group.
I'm sure this isn't as fast as the solutions using Pandas or Numpy operations, but if you don't want to rewrite your function you can use map. Using the original example data -
import pandas as pd
df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']
def get_sublist(sta,end):
return mylist[sta:end+1]
df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list
We could pass as many arguments as we wanted into the function this way. The output is what we wanted
ID col_1 col_2 col_3
0 1 0 1 [a, b]
1 2 2 4 [c, d, e]
2 3 3 5 [d, e, f]
My example to your questions:
def get_sublist(row, col1, col2):
return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')