There are several countries with numbers and/or parenthesis in my list. How I remove these?
e.g.
\'Bolivia (Plurinational State of)\' should be \'Bolivi
You can remove string by this way:-
Remove numbers:-
import re
a = 'Switzerland17'
pattern = '[0-9]'
res = re.sub(pattern, '', a)
print(res)
Output:-
'Switzerland'
Remove parenthesis:-
b = 'Bolivia (Plurinational State of)'
pattern2 = '(\s*\(.*\))'
res2 = re.sub(pattern2, '', b)
print(res2)
Output:-
'Bolivia'
Using Regex and simple List Operation
Go through the list items, find the regex matching in each item, and replace the values in place. This regex "[a-zA-Z]{2,}" works for only string matching with the minimum size of two or more. It gives your freedom based on parenthesis. The better approach for Regex is to use Matching string based on your input domain (i.e country in your case) and a Country name cannot have a number in its name or Parenthesis. SO you should use the following.
import re
list_of_country_strings = ["Switzerland17", "America290","Korea(S)"]
for index in range(len(list_of_country_strings)):
x = re.match("[a-zA-Z]{2,}",string = list_of_country_strings[index])
if x:
list_of_country_strings[index] = list_of_country_strings[index][x.start():x.end()]
print(list_of_country_strings)
Output ['Switzerland', 'America', 'Korea']
Use Series.str.replace with regex for replacement, \s*
is for possible spaces before (
, then \(.*\)
is for values ()
and values between |
is for regex or
and \d+
is for numbers with 1 or more digits:
df = pd.DataFrame({'a':['Bolivia (Plurinational State of)','Switzerland17']})
df['a'] = df['a'].str.replace('(\s*\(.*\)|\d+)','')
print (df)
a
0 Bolivia
1 Switzerland
Run just:
df.Country.replace(r'\d+|\s*\([^)]*\)', '', regex=True, inplace=True)
Assuming that the initial content of your DataFrame is:
Country
0 Bolivia (Plurinational State of)
1 Switzerland17
2 United Kingdom
after the above replace you will have:
Country
0 Bolivia
1 Switzerland
2 United Kingdom
The above pattern contains:
)
(between brackets no quotation is
needed),