I\'m looking to split a string Series at different points depending on the length of certain substrings:
In [47]: df = pd.DataFrame([\'group9class1\', \'grou
You can also use zip
together with a list comprehension.
df['group'], df['class'] = zip(
*[(string[:n], string[n:])
for string, n in zip(df.group_class, split_locations)])
>>> df
group_class group class
0 group9class1 group9 class1
1 group10class2 group10 class2
2 group11class20 group11 class20
This works, by using double [[]]
you can access the index value of the current element so you can index into the split_locations
series:
In [119]:
df[['group_class']].apply(lambda x: pd.Series([x.str[split_locations[x.name]:][0], x.str[:split_locations[x.name]][0]]), axis=1)
Out[119]:
0 1
0 class1 group9
1 class2 group10
2 class20 group11
Or as @ajcr has suggested you can extract
:
In [106]:
df['group_class'].str.extract(r'(?P<group>group[0-9]+)(?P<class>class[0-9]+)')
Out[106]:
group class
0 group9 class1
1 group10 class2
2 group11 class20
EDIT
Regex explanation:
the regex came from @ajcr (thanks!), this uses str.extract to extract groups, the groups become new columns.
So ?P<group> here identifies an id for a specific group to look for, if this is missing then an int will be returned for the column name.
so the rest should be self-explanatory: group[0-9]
looks for the string group
followed by the digits in range [0-9]
which is what the []
indicate, this is equivalent to group\d
where \d
means digit.
So it could be re-written as:
df['group_class'].str.extract(r'(?P<group>group\d+)(?P<class>class\d+)')
Use a regular expression to split the string
import re
regex = re.compile("(class)")
str="group1class23"
# this will split the group and the class string by adding a space between them, and using a simple split on space.
split_string = re.sub(regex, " \\1", str).split(" ")
This will return the array:
['group9', 'class23']
So to append two new columns to your DataFrame
you can do:
new_cols = [re.sub(regex, " \\1", x).split(" ") for x in df.group_class]
df['group'], df['class'] = zip(*new_cols)
Which results in:
group_class group class
0 group9class1 group9 class1
1 group10class2 group10 class2
2 group11class20 group11 class20