applying regex to a pandas dataframe

前端 未结 5 1691
我寻月下人不归
我寻月下人不归 2020-12-05 13:39

I\'m having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:

               Name   Season          School         


        
相关标签:
5条回答
  • 2020-12-05 13:51

    The asked problem can be solved by writing the following code :

    import re
    def split_it(year):
        x = re.findall('([\d]{4})', year)
        if x :
          return(x.group())
    
    df['Season2'] = df['Season'].apply(split_it)
    

    You were facing this problem as some rows didn't had year in the string

    0 讨论(0)
  • 2020-12-05 14:04

    When I try (a variant of) your code I get NameError: name 'x' is not defined-- which it isn't.

    You could use either

    df['Season2'] = df['Season'].apply(split_it)
    

    or

    df['Season2'] = df['Season'].apply(lambda x: split_it(x))
    

    but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:

    >>> df["Season"].apply(split_it)
    74     [1982]
    84     [1982]
    176    [1982]
    177    [1983]
    243    [1982]
    Name: Season, dtype: object
    

    although you could easily change that. FWIW, I'd use vectorized string operations and do something like

    >>> df["Season"].str[:4].astype(int)
    74     1982
    84     1982
    176    1982
    177    1983
    243    1982
    Name: Season, dtype: int64
    

    or

    >>> df["Season"].str.split("-").str[0].astype(int)
    74     1982
    84     1982
    176    1982
    177    1983
    243    1982
    Name: Season, dtype: int64
    
    0 讨论(0)
  • 2020-12-05 14:04

    You can simply use str.extract

    df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')
    

    Here you locate \d{4}-\d{2} (for example 1982-83) but only extracts the captured group between parenthesis \d{4} (for example 1982)

    0 讨论(0)
  • 2020-12-05 14:07

    I had the exact same issue. Thanks for the answers @DSM. FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'.

    def split_it(year):  
        return re.findall('(\d\d\d\d)', year)
    

    Becomes:

    def split_it(year):
        return re.findall('(\d{4})', year)
    
    0 讨论(0)
  • 2020-12-05 14:08

    you can use pandas native function to do it too.

    check this page for the pandas functions that accepts regular expression. for your case, you can do

    df["Season"].str.extract(r'([\d]{4}))')
    
    0 讨论(0)
提交回复
热议问题