Remove opening and closing parenthesis with word in pandas

前端 未结 4 862
庸人自扰
庸人自扰 2021-01-26 01:07

Given a data frame:

df = 

                         multi
0 MULTIPOLYGON(((3 11, 2 33)))
1 MULTIPOLYGON(((4 22, 5 66)))

I was trying to remov

相关标签:
4条回答
  • 2021-01-26 01:31

    Apply is a rather slow method in pandas since it's basically a loop that iterates over each row and apply's your function. Pandas has vectorized methods, we can use str.extract here to extract your pattern:

    df['multi'] = df['multi'].str.extract('(\d\.\d+\s\d+\.\d+)')
    
            multi
    0  3.49 11.10
    1  4.49 22.12
    
    0 讨论(0)
  • 2021-01-26 01:33

    You can also use str.replace with a regex:

    # removes anything that's not a digit or a space or a dot
    df['multi'] = df.multi.str.replace('[^0-9\. ]', '', regex=True)#changing regex
    
    0 讨论(0)
  • 2021-01-26 01:43

    Try this:

        import pandas as pd
    import re 
    def f(x):
        x = ' '.join(re.findall(r'[0-9, ]+',x))
        return x
    
    def f2(x):
        x = re.findall(r'[0-9, ]+',x)
    
        return pd.Series(x[0].split(','))       
    
    
    df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
    df['a'] = df['a'].apply(f)
    print(df)
    #or for different columns you can do
    df =pd.DataFrame({'a':['MULTIPOLYGON(((3 11, 2 33)))' ,'MULTIPOLYGON(((4 22, 5 6)))']})
    #df['multi'] = df.a.str.replace('[^0-9. ]', '', regex=True)
    #print(df)
    list_of_cols = ['c1','c2']
    df[list_of_cols] = df['a'].apply(f2)
    del df['a']
    print(df)
    

    output:

                a
    0  3 11, 2 33
    1   4 22, 5 6
         c1     c2
    0  3 11   2 33
    1  4 22    5 6
    [Finished in 2.5s]
    
    0 讨论(0)
  • 2021-01-26 01:50

    You can use df.column.str in the following way.

    df['a'] = df['a'].str.findall(r'[0-9.]+')
    df = pd.DataFrame(df['a'].tolist())
    print(df)
    

    output:

         0     1
    0  3.49  11.10
    1  4.49  22.12
    

    This will work for any number of columns. But in the end you have to name those columns.

    df.columns = ['a'+str(i) for i in range(df.shape[1])]
    

    This method will work even when some rows have different number of numerical values. like

    df =pd.DataFrame({'a':['MULTIPOLYGON(((3.49)))' ,'MULTIPOLYGON(((4.49 22.12)))']})
    
         a
     0  MULTIPOLYGON(((3.49)))
     1  MULTIPOLYGON(((4.49 22.12)))
    

    So the expected output is

          0     1
    0   3.49    None
    1   4.49    22.12
    

    After naming the columns using,

    df.columns = ['a'+str(i) for i in range(df.shape[1])]
    

    You get,

          a0    a1
    0   3.49    None
    1   4.49    22.12
    
    0 讨论(0)
提交回复
热议问题