I just discovered the assign
method for pandas dataframes, and it looks nice and very similar to dplyr\'s mutate
in R. However, I\'ve always gotten
The premise on assign is that it returns:
A new DataFrame with the new columns in addition to all the existing columns.
And also you cannot do anything in-place to change the original dataframe.
The callable must not change input DataFrame (though pandas doesn't check it).
On the other hand df['ln_A'] = np.log(df['A'])
will do things inplace.
So is there a reason I should stop using my old method in favour of
df.assign
?
I think you can try df.assign
but if you do memory intensive stuff, better to work what you did before or operations with inplace=True
.
The difference concerns whether you wish to modify an existing frame, or create a new frame while maintaining the original frame as it was.
In particular, DataFrame.assign
returns you a new object that has a copy of the original data with the requested changes ... the original frame remains unchanged.
In your particular case:
>>> df = DataFrame({'A': range(1, 11), 'B': np.random.randn(10)})
Now suppose you wish to create a new frame in which A
is everywhere 1
without destroying df
. Then you could use .assign
>>> new_df = df.assign(A=1)
If you do not wish to maintain the original values, then clearly df["A"] = 1
will be more appropriate. This also explains the speed difference, by necessity .assign
must copy the data while [...]
does not.