How to replace values in a Pandas series s
via a dictionary d
has been asked and re-asked many times.
The recommended method (1, 2, 3, 4) i
One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.
General case
df['A'].map(d)
if all values mapped; ordf['A'].map(d).fillna(df['A']).astype(int)
if >5% values mapped.Few, e.g. < 5%, values in d
df['A'].replace(d)
The "crossover point" of ~5% is specific to Benchmarking below.
Interestingly, a simple list comprehension generally underperforms map
in either scenario.
Benchmarking
import pandas as pd, numpy as np
df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()
##### TEST 1 - Full Map #####
d = {i: i+1 for i in range(1000)}
%timeit df['A'].replace(d) # 1.98s
%timeit df['A'].map(d) # 84.3ms
%timeit [d[i] for i in lst] # 134ms
##### TEST 2 - Partial Map #####
d = {i: i+1 for i in range(10)}
%timeit df['A'].replace(d) # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int) # 111ms
%timeit [d.get(i, i) for i in lst] # 243ms
Explanation
The reason why s.replace
is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.
This is an excerpt from replace()
in pandas\generic.py.
items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]
if any(are_mappings):
# handling of nested dictionaries
else:
to_replace, value = keys, values
return self.replace(to_replace, value, inplace=inplace,
limit=limit, regex=regex)
There appear to be many steps involved:
This can be compared to much leaner code from map()
in pandas\series.py:
if isinstance(arg, (dict, Series)):
if isinstance(arg, dict):
arg = self._constructor(arg, index=arg.keys())
indexer = arg.index.get_indexer(values)
new_values = algos.take_1d(arg._values, indexer)