Replace values in a pandas series via dictionary efficiently

前端 未结 1 1352
陌清茗
陌清茗 2020-11-22 06:07

How to replace values in a Pandas series s via a dictionary d has been asked and re-asked many times.

The recommended method (1, 2, 3, 4) i

相关标签:
1条回答
  • 2020-11-22 06:47

    One trivial solution is to choose a method dependent on an estimate of how completely values are covered by dictionary keys.

    General case

    • Use df['A'].map(d) if all values mapped; or
    • Use df['A'].map(d).fillna(df['A']).astype(int) if >5% values mapped.

    Few, e.g. < 5%, values in d

    • Use df['A'].replace(d)

    The "crossover point" of ~5% is specific to Benchmarking below.

    Interestingly, a simple list comprehension generally underperforms map in either scenario.

    Benchmarking

    import pandas as pd, numpy as np
    
    df = pd.DataFrame({'A': np.random.randint(0, 1000, 1000000)})
    lst = df['A'].values.tolist()
    
    ##### TEST 1 - Full Map #####
    
    d = {i: i+1 for i in range(1000)}
    
    %timeit df['A'].replace(d)                          # 1.98s
    %timeit df['A'].map(d)                              # 84.3ms
    %timeit [d[i] for i in lst]                         # 134ms
    
    ##### TEST 2 - Partial Map #####
    
    d = {i: i+1 for i in range(10)}
    
    %timeit df['A'].replace(d)                          # 20.1ms
    %timeit df['A'].map(d).fillna(df['A']).astype(int)  # 111ms
    %timeit [d.get(i, i) for i in lst]                  # 243ms
    

    Explanation

    The reason why s.replace is so slow is that it does much more than simply map a dictionary. It deals with some edge cases and arguably rare situations, which typically merit more care in any case.

    This is an excerpt from replace() in pandas\generic.py.

    items = list(compat.iteritems(to_replace))
    keys, values = zip(*items)
    are_mappings = [is_dict_like(v) for v in values]
    
    if any(are_mappings):
        # handling of nested dictionaries
    else:
        to_replace, value = keys, values
    
    return self.replace(to_replace, value, inplace=inplace,
                        limit=limit, regex=regex)
    

    There appear to be many steps involved:

    • Converting dictionary to a list.
    • Iterating through list and checking for nested dictionaries.
    • Feeding an iterator of keys and values into a replace function.

    This can be compared to much leaner code from map() in pandas\series.py:

    if isinstance(arg, (dict, Series)):
        if isinstance(arg, dict):
            arg = self._constructor(arg, index=arg.keys())
    
        indexer = arg.index.get_indexer(values)
        new_values = algos.take_1d(arg._values, indexer)
    
    0 讨论(0)
提交回复
热议问题