How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe?

后端 未结 5 1619
清酒与你
清酒与你 2020-11-30 05:43

I have a dataframe which looks like this:

     A       B           C
1   red78   square    big235
2   green   circle    small123
3   blue45  triangle  big657         


        
相关标签:
5条回答
  • 2020-11-30 06:20

    Use str.extract and pass a regex pattern to extract just the numeric parts:

    In[40]:
    dfObject['C'] = dfObject['C'].str.extract('(\d+)', expand=False)
    dfObject
    
    Out[40]: 
            A         B    C
    1   red78    square  235
    2   green    circle  123
    3  blue45  triangle  657
    

    If needed you can cast to int:

    dfObject['C'] = dfObject['C'].astype(int)
    
    0 讨论(0)
  • 2020-11-30 06:23

    You can also do this via a lambda function with str.isdigit:

    import pandas as pd
    
    df = pd.DataFrame({'Name': ['John5', 'Tom 8', 'Ron 722']})
    
    df['Name'] = df['Name'].map(lambda x: ''.join([i for i in x if i.isdigit()]))
    
    #   Name
    # 0    5
    # 1    8
    # 2  722
    
    0 讨论(0)
  • 2020-11-30 06:30

    To remove all non-digit characters from strings in a Pandas column you should use str.replace with \D+ or [^0-9]+ patterns:

    dfObject['C'] = dfObject['C'].str.replace(r'\D+', '')
    

    Or, since in Python 3, \D is fully Unicode-aware by default and thus does not match non-ASCII digits (like ۱۲۳۴۵۶۷۸۹, see proof) you should consider

    dfObject['C'] = dfObject['C'].str.replace(r'[^0-9]+', '')
    

    So,

    import re
    print ( re.sub( r'\D+', '', '1۱۲۳۴۵۶۷۸۹0') )         # => 1۱۲۳۴۵۶۷۸۹0
    print ( re.sub( r'[^0-9]+', '', '1۱۲۳۴۵۶۷۸۹0') )     # => 10
    
    0 讨论(0)
  • 2020-11-30 06:39

    You can use .str.replace with a regex:

    dfObject['C'] = dfObject.C.str.replace(r"[a-zA-Z]",'')
    

    output:

            A         B    C
    1   red78    square  235
    2   green    circle  123
    3  blue45  triangle  657
    
    0 讨论(0)
  • 2020-11-30 06:39

    After 2 years, to help others, I actually think that you were very close to the answer. I have used your logic but made it work. basically you create a function that does the clean up and then apply it to the column C.

    import pandas as pd
    import re
    
    df = pd.DataFrame({
         'A': ['red78', 'green', 'blue45'],
         'B': ['square', 'circle', 'triangle'],
        'C': ['big235', 'small123',  'big657']
    })
    
    def remove_chars(s):
        return re.sub('[^0-9]+', '', s) 
    
    df['C'] = df['C'].apply(remove_chars)
    df
    

    Result below:

    A   B   C
    0   red78   square  235
    1   green   circle  123
    2   blue45  triangle    657
    
    0 讨论(0)
提交回复
热议问题