I have a dataframe which looks like this:
A B C
1 red78 square big235
2 green circle small123
3 blue45 triangle big657
Use str.extract
and pass a regex pattern to extract just the numeric parts:
In[40]:
dfObject['C'] = dfObject['C'].str.extract('(\d+)', expand=False)
dfObject
Out[40]:
A B C
1 red78 square 235
2 green circle 123
3 blue45 triangle 657
If needed you can cast to int
:
dfObject['C'] = dfObject['C'].astype(int)
You can also do this via a lambda
function with str.isdigit
:
import pandas as pd
df = pd.DataFrame({'Name': ['John5', 'Tom 8', 'Ron 722']})
df['Name'] = df['Name'].map(lambda x: ''.join([i for i in x if i.isdigit()]))
# Name
# 0 5
# 1 8
# 2 722
To remove all non-digit characters from strings in a Pandas column you should use str.replace with \D+
or [^0-9]+
patterns:
dfObject['C'] = dfObject['C'].str.replace(r'\D+', '')
Or, since in Python 3, \D
is fully Unicode-aware by default and thus does not match non-ASCII digits (like ۱۲۳۴۵۶۷۸۹
, see proof) you should consider
dfObject['C'] = dfObject['C'].str.replace(r'[^0-9]+', '')
So,
import re
print ( re.sub( r'\D+', '', '1۱۲۳۴۵۶۷۸۹0') ) # => 1۱۲۳۴۵۶۷۸۹0
print ( re.sub( r'[^0-9]+', '', '1۱۲۳۴۵۶۷۸۹0') ) # => 10
You can use .str.replace
with a regex:
dfObject['C'] = dfObject.C.str.replace(r"[a-zA-Z]",'')
output:
A B C
1 red78 square 235
2 green circle 123
3 blue45 triangle 657
After 2 years, to help others, I actually think that you were very close to the answer. I have used your logic but made it work. basically you create a function that does the clean up and then apply it to the column C
.
import pandas as pd
import re
df = pd.DataFrame({
'A': ['red78', 'green', 'blue45'],
'B': ['square', 'circle', 'triangle'],
'C': ['big235', 'small123', 'big657']
})
def remove_chars(s):
return re.sub('[^0-9]+', '', s)
df['C'] = df['C'].apply(remove_chars)
df
Result below:
A B C
0 red78 square 235
1 green circle 123
2 blue45 triangle 657