Remap values in pandas column with a dict

后端 未结 10 1097
囚心锁ツ
囚心锁ツ 2020-11-21 05:14

I have a dictionary which looks like this: di = {1: \"A\", 2: \"B\"}

I would like to apply it to the \"col1\" column of a dataframe similar to:

相关标签:
10条回答
  • 2020-11-21 05:28

    Given map is faster than replace (@JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.

    import pandas as pd
    import numpy as np
    
    d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
    df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
    

    keep_nan = [k for k,v in d.items() if pd.isnull(v)]
    s = df['gender']
    
    df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
    

        gender  mapped
    0        m    Male
    1        f  Female
    2  missing     NaN
    3     Male    Male
    4        U       U
    
    0 讨论(0)
  • 2020-11-21 05:31

    There is a bit of ambiguity in your question. There are at least three two interpretations:

    1. the keys in di refer to index values
    2. the keys in di refer to df['col1'] values
    3. the keys in di refer to index locations (not the OP's question, but thrown in for fun.)

    Below is a solution for each case.


    Case 1: If the keys of di are meant to refer to index values, then you could use the update method:

    df['col1'].update(pd.Series(di))
    

    For example,

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'col1':['w', 10, 20],
                       'col2': ['a', 30, np.nan]},
                      index=[1,2,0])
    #   col1 col2
    # 1    w    a
    # 2   10   30
    # 0   20  NaN
    
    di = {0: "A", 2: "B"}
    
    # The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
    df['col1'].update(pd.Series(di))
    print(df)
    

    yields

      col1 col2
    1    w    a
    2    B   30
    0    A  NaN
    

    I've modified the values from your original post so it is clearer what update is doing. Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.


    Case 2: If the keys in di refer to df['col1'] values, then @DanAllan and @DSM show how to achieve this with replace:

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame({'col1':['w', 10, 20],
                       'col2': ['a', 30, np.nan]},
                      index=[1,2,0])
    print(df)
    #   col1 col2
    # 1    w    a
    # 2   10   30
    # 0   20  NaN
    
    di = {10: "A", 20: "B"}
    
    # The values 10 and 20 are replaced by 'A' and 'B'
    df['col1'].replace(di, inplace=True)
    print(df)
    

    yields

      col1 col2
    1    w    a
    2    A   30
    0    B  NaN
    

    Note how in this case the keys in di were changed to match values in df['col1'].


    Case 3: If the keys in di refer to index locations, then you could use

    df['col1'].put(di.keys(), di.values())
    

    since

    df = pd.DataFrame({'col1':['w', 10, 20],
                       'col2': ['a', 30, np.nan]},
                      index=[1,2,0])
    di = {0: "A", 2: "B"}
    
    # The values at the 0 and 2 index locations are replaced by 'A' and 'B'
    df['col1'].put(di.keys(), di.values())
    print(df)
    

    yields

      col1 col2
    1    A    a
    2   10   30
    0    B  NaN
    

    Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.

    0 讨论(0)
  • 2020-11-21 05:40

    As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:

    df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
    

    The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.

    Consequently you can apply the Series method map().

    Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
    The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method

    0 讨论(0)
  • 2020-11-21 05:44

    You can use .replace. For example:

    >>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
    >>> di = {1: "A", 2: "B"}
    >>> df
      col1 col2
    0    w    a
    1    1    2
    2    2  NaN
    >>> df.replace({"col1": di})
      col1 col2
    0    w    a
    1    A    2
    2    B  NaN
    

    or directly on the Series, i.e. df["col1"].replace(di, inplace=True).

    0 讨论(0)
  • 2020-11-21 05:45

    DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):

    import pandas as pd
    
    df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
                'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
    
    conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
    df['converted_column'] = df['col2'].replace(conversion_dict)
    
    print(df.head())
    

    You'll see it looks like:

       col1      col2  converted_column
    0     1  negative                -1
    1     2  positive                 1
    2     2   neutral                 0
    3     3   neutral                 0
    4     1  positive                 1
    

    The docs for pandas.DataFrame.replace are here.

    0 讨论(0)
  • 2020-11-21 05:47

    map can be much faster than replace

    If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):

    Exhaustive Mapping

    In this case, the form is very simple:

    df['col1'].map(di)       # note: if the dictionary does not exhaustively map all
                             # entries then non-matched entries are changed to NaNs
    

    Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map

    Non-Exhaustive Mapping

    If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:

    df['col1'].map(di).fillna(df['col1'])
    

    as in @jpp's answer here: Replace values in a pandas series via dictionary efficiently

    Benchmarks

    Using the following data with pandas version 0.23.1:

    di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
    df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
    

    and testing with %timeit, it appears that map is approximately 10x faster than replace.

    Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See @jpp answer (linked above) for more extensive benchmarks and discussion.

    0 讨论(0)
提交回复
热议问题