Remove opening and closing parenthesis with word in pandas

前端未结

关注

 4  862

Given a data frame:

df = 

                         multi
0 MULTIPOLYGON(((3 11, 2 33)))
1 MULTIPOLYGON(((4 22, 5 66)))

I was trying to remov

相关标签:

4条回答

名媛妹妹

2021-01-26 01:31
Apply is a rather slow method in pandas since it's basically a loop that iterates over each row and apply's your function. Pandas has vectorized methods, we can use str.extract here to extract your pattern:
```
df['multi'] = df['multi'].str.extract('(\d\.\d+\s\d+\.\d+)')

        multi
0  3.49 11.10
1  4.49 22.12
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

终归单人心

2021-01-26 01:33

You can also use str.replace with a regex:

# removes anything that's not a digit or a space or a dot
df['multi'] = df.multi.str.replace('[^0-9\. ]', '', regex=True)#changing regex

0 讨论(0)

一生所求

2021-01-26 01:43

Try this:

    import pandas as pd
import re 
def f(x):
    x = ' '.join(re.findall(r'[0-9, ]+',x))
    return x

def f2(x):
    x = re.findall(r'[0-9, ]+',x)

    return pd.Series(x[0].split(','))       


df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
df['a'] = df['a'].apply(f)
print(df)
#or for different columns you can do
df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
#df['multi'] = df.a.str.replace('[^0-9. ]', '', regex=True)
#print(df)
list_of_cols = ['c1','c2']
df[list_of_cols] = df['a'].apply(f2)
del df['a']
print(df)

output:

            a
0  3 11, 2 33
1   4 22, 5 6
     c1     c2
0  3 11   2 33
1  4 22    5 6
[Finished in 2.5s]

0 讨论(0)

野性不改

2021-01-26 01:50

You can use df.column.str in the following way.

df['a'] = df['a'].str.findall(r'[0-9.]+')
df = pd.DataFrame(df['a'].tolist())
print(df)

output:

     0     1
0  3.49  11.10
1  4.49  22.12

This will work for any number of columns. But in the end you have to name those columns.

df.columns = ['a'+str(i) for i in range(df.shape[1])]

This method will work even when some rows have different number of numerical values. like

df =pd.DataFrame({'a':['MULTIPOLYGON(((3.49)))' ,'MULTIPOLYGON(((4.49 22.12)))']})

     a
 0  MULTIPOLYGON(((3.49)))
 1  MULTIPOLYGON(((4.49 22.12)))

So the expected output is

      0     1
0   3.49    None
1   4.49    22.12

After naming the columns using,

df.columns = ['a'+str(i) for i in range(df.shape[1])]

You get,

      a0    a1
0   3.49    None
1   4.49    22.12

0 讨论(0)