Replace values in a pandas series via dictionary efficiently

前端未结

关注

 1  1352

陌清茗

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

The recommended method (1, 2, 3, 4) i

相关标签:

1条回答

天命终不由人

2020-11-22 06:47

One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

General case

Use df['A'].map(d) if all values mapped; or
Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

Few, e.g. < 5%, values in d

Use df['A'].replace(d)

The "crossover point" of ~5% is specific to Benchmarking below.

Interestingly, a simple list comprehension generally underperforms map in either scenario.

Benchmarking

import pandas as pd, numpy as np

df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
lst = df['A'].values.tolist()

##### TEST 1 - Full Map #####

d = {i: i+1 for i in range(1000)}

%timeit df['A'].replace(d)                          # 1.98s
%timeit df['A'].map(d)                              # 84.3ms
%timeit [d[i] for i in lst]                         # 134ms

##### TEST 2 - Partial Map #####

d = {i: i+1 for i in range(10)}

%timeit df['A'].replace(d)                          # 20.1ms
%timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
%timeit [d.get(i, i) for i in lst]                  # 243ms

Explanation

The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

This is an excerpt from replace() in pandas\generic.py.

items = list(compat.iteritems(to_replace))
keys, values = zip(*items)
are_mappings = [is_dict_like(v) for v in values]

if any(are_mappings):
    # handling of nested dictionaries
else:
    to_replace, value = keys, values

return self.replace(to_replace, value, inplace=inplace,
                    limit=limit, regex=regex)

There appear to be many steps involved:

Converting dictionary to a list.
Iterating through list and checking for nested dictionaries.
Feeding an iterator of keys and values into a replace function.

This can be compared to much leaner code from map() in pandas\series.py:

if isinstance(arg, (dict, Series)):
    if isinstance(arg, dict):
        arg = self._constructor(arg, index=arg.keys())

    indexer = arg.index.get_indexer(values)
    new_values = algos.take_1d(arg._values, indexer)

0 讨论(0)