Pandas extract numbers from column into new columns

后端 未结 5 2302
谎友^
谎友^ 2020-12-21 05:51

I currently have this df where the rect column is all strings. I need to extract the x, y, w and h from it into separate columns. The dataset is very large so I need an effi

相关标签:
5条回答
  • 2020-12-21 05:56

    Use str.extract, which extracts groups from regex into columns:

    df['rect'].str.extract(r'\((?P<x>\d+),(?P<y>\d+)\),(?P<w>\d+) by (?P<h>\d+)', expand=True)
    

    Result:

         x    y    w    h
    0  120  168  260  120
    1  120  168  260  120
    2  120  168  260  120
    3  120  168  260  120
    4  120  168  260  120
    
    0 讨论(0)
  • Using extractall

    df[['x', 'y', 'w', 'h']] = df['rect'].str.extractall('(\d+)').unstack().loc[:,0]
    Out[267]: 
    match    0    1    2    3
    0      120  168  260  120
    1      120  168  260  120
    2      120  168  260  120
    3      120  168  260  120
    4      120  168  260  120
    
    0 讨论(0)
  • 2020-12-21 06:07

    Inline

    Produce a copy

    df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
    
                              rect    x    y    w    h
    0  <Rect (120,168),260 by 120>  120  168  260  120
    1  <Rect (120,168),260 by 120>  120  168  260  120
    2  <Rect (120,168),260 by 120>  120  168  260  120
    3  <Rect (120,168),260 by 120>  120  168  260  120
    4  <Rect (120,168),260 by 120>  120  168  260  120
    

    Or just reassign to df

    df = df.assign(**dict(zip('xywh', df.rect.str.findall('\d+').str)))
    
    df
    
                              rect    x    y    w    h
    0  <Rect (120,168),260 by 120>  120  168  260  120
    1  <Rect (120,168),260 by 120>  120  168  260  120
    2  <Rect (120,168),260 by 120>  120  168  260  120
    3  <Rect (120,168),260 by 120>  120  168  260  120
    4  <Rect (120,168),260 by 120>  120  168  260  120
    

    Inplace

    Modify existing df

    df[[*'xywh']] = pd.DataFrame(df.rect.str.findall('\d+').tolist())
    
    df
    
                              rect    x    y    w    h
    0  <Rect (120,168),260 by 120>  120  168  260  120
    1  <Rect (120,168),260 by 120>  120  168  260  120
    2  <Rect (120,168),260 by 120>  120  168  260  120
    3  <Rect (120,168),260 by 120>  120  168  260  120
    4  <Rect (120,168),260 by 120>  120  168  260  120
    
    0 讨论(0)
  • 2020-12-21 06:11

    If the strings follow a specific format <Rect \((\d+),(\d+)\),(\d+) by (\d+)>, you can use this regular expression with str.extract method:

    df[['x','y','w','h']] = df.rect.str.extract(r'<Rect \((\d+),(\d+)\),(\d+) by (\d+)>')
    
    df
    #                          rect    x    y    w    h
    #0  <Rect (120,168),260 by 120>  120  168  260  120
    #1  <Rect (120,168),260 by 120>  120  168  260  120
    #2  <Rect (120,168),260 by 120>  120  168  260  120
    #3  <Rect (120,168),260 by 120>  120  168  260  120
    #4  <Rect (120,168),260 by 120>  120  168  260  120
    
    0 讨论(0)
  • 2020-12-21 06:15

    This is one of those cases where it makes sense to "optimize" the data itself instead of trying to morph it into what a consumer wants. It's much easier to change clean data into a specialized format than it is to change a specialized format into something portable.

    That said, if you really have to parse this, you can do something like

    >>> import re
    >>> re.findall(r'\d+', '<Rect (120,168),260 by 120>')
    ['120', '168', '260', '120']
    >>>
    
    0 讨论(0)
提交回复
热议问题