Can you tell me when to use these vectorization methods with basic examples?
I see that map
is a Series
method whereas the rest are
Just wanted to point out, as I struggled with this for a bit
def f(x):
if x < 0:
x = 0
elif x > 100000:
x = 100000
return x
df.applymap(f)
df.describe()
df = df.applymap(f)
df.describe()
My understanding:
From the function point of view:
If the function has variables that need to compare within a column/ row, use
apply
.
e.g.: lambda x: x.max()-x.mean()
.
If the function is to be applied to each element:
1> If a column/row is located, use apply
2> If apply to entire dataframe, use applymap
majority = lambda x : x > 17
df2['legal_drinker'] = df2['age'].apply(majority)
def times10(x):
if type(x) is int:
x *= 10
return x
df2.applymap(times10)
@jeremiahbuddha mentioned that apply works on row/columns, while applymap works element-wise. But it seems you can still use apply for element-wise computation....
frame.apply(np.sqrt)
Out[102]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
frame.applymap(np.sqrt)
Out[103]:
b d e
Utah NaN 1.435159 NaN
Ohio 1.098164 0.510594 0.729748
Texas NaN 0.456436 0.697337
Oregon 0.359079 NaN NaN
Based on the answer of cs95
map
is defined on Series ONLYapplymap
is defined on DataFrames ONLYapply
is defined on BOTHgive some examples
In [3]: frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
In [4]: frame
Out[4]:
b d e
Utah 0.129885 -0.475957 -0.207679
Ohio -2.978331 -1.015918 0.784675
Texas -0.256689 -0.226366 2.262588
Oregon 2.605526 1.139105 -0.927518
In [5]: myformat=lambda x: f'{x:.2f}'
In [6]: frame.d.map(myformat)
Out[6]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [7]: frame.d.apply(myformat)
Out[7]:
Utah -0.48
Ohio -1.02
Texas -0.23
Oregon 1.14
Name: d, dtype: object
In [8]: frame.applymap(myformat)
Out[8]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [9]: frame.apply(lambda x: x.apply(myformat))
Out[9]:
b d e
Utah 0.13 -0.48 -0.21
Ohio -2.98 -1.02 0.78
Texas -0.26 -0.23 2.26
Oregon 2.61 1.14 -0.93
In [10]: myfunc=lambda x: x**2
In [11]: frame.applymap(myfunc)
Out[11]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289
In [12]: frame.apply(myfunc)
Out[12]:
b d e
Utah 0.016870 0.226535 0.043131
Ohio 8.870453 1.032089 0.615714
Texas 0.065889 0.051242 5.119305
Oregon 6.788766 1.297560 0.860289
First major difference: DEFINITION
map
is defined on Series ONLYapplymap
is defined on DataFrames ONLYapply
is defined on BOTHSecond major difference: INPUT ARGUMENT
map
accepts dict
s, Series
, or callableapplymap
and apply
accept callables onlyThird major difference: BEHAVIOR
map
is elementwise for Seriesapplymap
is elementwise for DataFramesapply
also works elementwise but is suited to more complex operations and aggregation. The behaviour and return value depends on the function.Fourth major difference (the most important one): USE CASE
map
is meant for mapping values from one domain to another, so is optimised for performance (e.g., df['A'].map({1:'a', 2:'b', 3:'c'})
)applymap
is good for elementwise transformations across multiple rows/columns (e.g., df[['A', 'B', 'C']].applymap(str.strip)
)apply
is for applying any function that cannot be vectorised (e.g., df['sentences'].apply(nltk.sent_tokenize)
)Footnotes
map
when passed a dictionary/Series will map elements based on the keys in that dictionary/Series. Missing values will be recorded as NaN in the output.
applymap
in more recent versions has been optimised for some operations. You will findapplymap
slightly faster thanapply
in some cases. My suggestion is to test them both and use whatever works better.
map
is optimised for elementwise mappings and transformation. Operations that involve dictionaries or Series will enable pandas to use faster code paths for better performance.Series.apply
returns a scalar for aggregating operations, Series otherwise. Similarly forDataFrame.apply
. Note thatapply
also has fastpaths when called with certain NumPy functions such asmean
,sum
, etc.
Adding to the other answers, in a Series
there are also map and apply.
Apply can make a DataFrame out of a series; however, map will just put a series in every cell of another series, which is probably not what you want.
In [40]: p=pd.Series([1,2,3])
In [41]: p
Out[31]:
0 1
1 2
2 3
dtype: int64
In [42]: p.apply(lambda x: pd.Series([x, x]))
Out[42]:
0 1
0 1 1
1 2 2
2 3 3
In [43]: p.map(lambda x: pd.Series([x, x]))
Out[43]:
0 0 1
1 1
dtype: int64
1 0 2
1 2
dtype: int64
2 0 3
1 3
dtype: int64
dtype: object
Also if I had a function with side effects, such as "connect to a web server", I'd probably use apply
just for the sake of clarity.
series.apply(download_file_for_every_element)
Map
can use not only a function, but also a dictionary or another series. Let's say you want to manipulate permutations.
Take
1 2 3 4 5
2 1 4 5 3
The square of this permutation is
1 2 3 4 5
1 2 5 3 4
You can compute it using map
. Not sure if self-application is documented, but it works in 0.15.1
.
In [39]: p=pd.Series([1,0,3,4,2])
In [40]: p.map(p)
Out[40]:
0 0
1 1
2 4
3 2
4 3
dtype: int64