Populating Pandas DataFrame column based on dictionary of regex

后端 未结 2 869
清歌不尽
清歌不尽 2021-01-20 13:43

I have a dataframe like the following:

    GE    GO
1   AD    Weiss
2   KI    Ruby
3   OH    Port
4   ER    Rose
5   KI    Rose
6   JJ    Weiss
7   OH    7UP         


        
相关标签:
2条回答
  • 2021-01-20 14:02

    One option is to make use of re module with a map on the GO column:

    import re
    df['OUT'] = df.GO.map(lambda x: next(Dic[k] for k in Dic if re.search(k, x)))
    df
    

    This raises error if none of the pattern matches the string. If there are cases where string doesn't match any pattern, you can write a custom function to capture the exception and return None:

    import re
    def findCat(x):
        try:
            return next(Dic[k] for k in Dic if re.search(k, x))
        except:
            return None
    
    df['OUT'] = df.GO.map(findCat)
    df
    
    0 讨论(0)
  • 2021-01-20 14:06

    You can do it this way:

    In [253]: df['OUT'] = df[['GO']].replace({'GO':Dic}, regex=True)
    
    In [254]: df
    Out[254]:
        GE     GO   OUT
    1   AD  Weiss  Beer
    2   KI   Ruby  Beer
    3   OH   Port  Wine
    4   ER   Rose  Wine
    5   KI   Rose  Wine
    6   JJ  Weiss  Beer
    7   OH    7UP  Soda
    8   AD    7UP  Soda
    9   OP   Coke  Soda
    10  JJ  Stout  Beer
    

    Intereseting observation - in older Pandas versions, Series.map() method was almost always faster compared to DataFrame.replace() and Series.str.replace() methods. It got better in Pandas 0.19.2:

    In [267]: df = pd.concat([df] * 10**4, ignore_index=True)
    
    In [268]: %timeit df.GO.map(lambda x: next(Dic[k] for k in Dic if re.search(k, x)))
    1 loop, best of 3: 1.57 s per loop
    
    In [269]: %timeit df[['GO']].replace({'GO':Dic}, regex=True)
    1 loop, best of 3: 895 ms per loop
    
    In [270]: %timeit df.GO.replace(Dic, regex=True)
    1 loop, best of 3: 876 ms per loop
    
    In [271]: df.shape
    Out[271]: (100000, 2)
    
    0 讨论(0)
提交回复
热议问题