I currently have this df where the rect column is all strings. I need to extract the x, y, w and h from it into separate columns. The dataset is very large so I need an effi
Use str.extract, which extracts groups from regex into columns:
df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)
Result:
x y w h
0 120 168 260 120
1 120 168 260 120
2 120 168 260 120
3 120 168 260 120
4 120 168 260 120
Using extractall
df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
Out[267]:
match 0 1 2 3
0 120 168 260 120
1 120 168 260 120
2 120 168 260 120
3 120 168 260 120
4 120 168 260 120
Produce a copy
df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
Or just reassign to df
df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
df
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
Modify existing df
df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())
df
rect x y w h
0 <Rect (120,168),260 by 120> 120 168 260 120
1 <Rect (120,168),260 by 120> 120 168 260 120
2 <Rect (120,168),260 by 120> 120 168 260 120
3 <Rect (120,168),260 by 120> 120 168 260 120
4 <Rect (120,168),260 by 120> 120 168 260 120
If the strings follow a specific format <Rect \((\d+),(\d+)\),(\d+) by (\d+)>
, you can use this regular expression with str.extract
method:
df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')
df
# rect x y w h
#0 <Rect (120,168),260 by 120> 120 168 260 120
#1 <Rect (120,168),260 by 120> 120 168 260 120
#2 <Rect (120,168),260 by 120> 120 168 260 120
#3 <Rect (120,168),260 by 120> 120 168 260 120
#4 <Rect (120,168),260 by 120> 120 168 260 120
This is one of those cases where it makes sense to "optimize" the data itself instead of trying to morph it into what a consumer wants. It's much easier to change clean data into a specialized format than it is to change a specialized format into something portable.
That said, if you really have to parse this, you can do something like
>>> import re
>>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
['120', '168', '260', '120']
>>>