I\'m having trouble applying a regex function a column in a python dataframe. Here is the head of my dataframe:
Name Season School
The asked problem can be solved by writing the following code :
import re
def split_it(year):
x = re.findall('([\d]{4})', year)
if x :
return(x.group())
df['Season2'] = df['Season'].apply(split_it)
You were facing this problem as some rows didn't had year in the string
When I try (a variant of) your code I get NameError: name 'x' is not defined
-- which it isn't.
You could use either
df['Season2'] = df['Season'].apply(split_it)
or
df['Season2'] = df['Season'].apply(lambda x: split_it(x))
but the second one is just a longer and slower way to write the first one, so there's not much point (unless you have other arguments to handle, which we don't here.) Your function will return a list, though:
>>> df["Season"].apply(split_it)
74 [1982]
84 [1982]
176 [1982]
177 [1983]
243 [1982]
Name: Season, dtype: object
although you could easily change that. FWIW, I'd use vectorized string operations and do something like
>>> df["Season"].str[:4].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
or
>>> df["Season"].str.split("-").str[0].astype(int)
74 1982
84 1982
176 1982
177 1983
243 1982
Name: Season, dtype: int64
You can simply use str.extract
df['Season2']=df['Season'].str.extract(r'(\d{4})-\d{2}')
Here you locate \d{4}-\d{2}
(for example 1982-83) but only extracts the captured group between parenthesis \d{4}
(for example 1982)
I had the exact same issue. Thanks for the answers @DSM.
FYI @itjcms, you can improve the function by removing the repetition of the '\d\d\d\d'
.
def split_it(year):
return re.findall('(\d\d\d\d)', year)
Becomes:
def split_it(year):
return re.findall('(\d{4})', year)
you can use pandas native function to do it too.
check this page for the pandas functions that accepts regular expression. for your case, you can do
df["Season"].str.extract(r'([\d]{4}))')