Removing substring of from a list of strings

前端 未结 4 1874
长发绾君心
长发绾君心 2021-01-29 07:02

There are several countries with numbers and/or parenthesis in my list. How I remove these?

e.g.

\'Bolivia (Plurinational State of)\' should be \'Bolivi

相关标签:
4条回答
  • 2021-01-29 07:28

    You can remove string by this way:-

    Remove numbers:-

    import re
    a = 'Switzerland17'
    pattern = '[0-9]'
    res = re.sub(pattern, '', a)
    print(res)
    

    Output:-

    'Switzerland'
    

    Remove parenthesis:-

    b = 'Bolivia (Plurinational State of)'
    pattern2 = '(\s*\(.*\))'
    res2 = re.sub(pattern2, '', b)
    print(res2)
    

    Output:-

    'Bolivia'
    
    0 讨论(0)
  • 2021-01-29 07:40

    Using Regex and simple List Operation

    Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.

    import re 
    list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
    for index in range(len(list_of_country_strings)):
        x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
        if x:
            list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
    
    print(list_of_country_strings)
    

    Output ['Switzerland', 'America', 'Korea']

    0 讨论(0)
  • 2021-01-29 07:47

    Use Series.str.replace with regex for replacement, \s* is for possible spaces before (, then \(.*\) is for values () and values between | is for regex or and \d+ is for numbers with 1 or more digits:

    df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
    
    df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
    print (df)
                 a
    0      Bolivia
    1  Switzerland
    
    0 讨论(0)
  • 2021-01-29 07:48

    Run just:

    df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
    

    Assuming that the initial content of your DataFrame is:

                                Country
    0  Bolivia (Plurinational State of)
    1                     Switzerland17
    2                    United Kingdom
    

    after the above replace you will have:

              Country
    0         Bolivia
    1     Switzerland
    2  United Kingdom
    

    The above pattern contains:

    • the first alternative - a non-empty sequence of digits,
    • the second alternative:
      • an optional sequence of "white" chars,
      • an opening parenthesis (quoted),
      • a sequence of chars other than ) (between brackets no quotation is needed),
      • a closing parenthesis (also quoted).
    0 讨论(0)
提交回复
热议问题