I am using python2.7 and pandas 0.11.0.
I try to fill a column of a dataframe using DataFrame.apply(func). The func() function is supposed to return a numpy array (1
If you try to return multiple values from the function that is passed to apply
, and the DataFrame you call the apply
on has the same number of item along the axis (in this case columns) as the number of values you returned, Pandas will create a DataFrame from the return values with the same labels as the original DataFrame. You can see this if you just do:
>>> def test(row):
return [1, 2, 3]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
A B C
0 1 2 3
1 1 2 3
2 1 2 3
3 1 2 3
And that is why you get the error, since you cannot assign a DataFrame to DataFrame column.
If you return any other number of values, it will return just a series object, that can be assigned:
>>> def test(row):
return [1, 2]
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df.apply(test, axis=1)
0 [1, 2]
1 [1, 2]
2 [1, 2]
3 [1, 2]
>>> df['D'] = df.apply(test, axis=1)
>>> df
A B C D
0 0.333535 0.209745 -0.972413 [1, 2]
1 0.469590 0.107491 -1.248670 [1, 2]
2 0.234444 0.093290 -0.853348 [1, 2]
3 1.021356 0.092704 -0.406727 [1, 2]
I'm not sure why Pandas does this, and why it does it only when the return value is a list
or an ndarray
, since it won't do it if you return a tuple
:
>>> def test(row):
return (1, 2, 3)
>>> df= pd.DataFrame(np.random.randn(4, 3), columns=list('ABC'))
>>> df['D'] = df.apply(test, axis=1)
>>> df
A B C D
0 0.121136 0.541198 -0.281972 (1, 2, 3)
1 0.569091 0.944344 0.861057 (1, 2, 3)
2 -1.742484 -0.077317 0.181656 (1, 2, 3)
3 -1.541244 0.174428 0.660123 (1, 2, 3)