Hi I have a dataset as shown below:
Format,Message,time
A,ab@1 yl@5 rd@20 pp@40,3
B,bc@1 gn@7 yl@20 ss@25 rd@50, 21
C,cc@1 yl@9 rd@20, 22
I wou
Your question is really a number of questions. From the 'dataframe' tag, it appears you're doing this using pandas. The regular expression you're asking about could extra the numbers for 'yl' and 'rd' (if any, I'm assuming they are always there). But a regular expression typically doesn't do math or numerical comparisons, so that's a third bit.
A regular expression to match the numerical value for 'yl' (assuming integer, not float):
r'yl@(\d+)'
You could extract them in a single expression, but that would assume they are always in the same order, or become a complicated regular expression.
To ensure only yl@5
gets matched, but something like xyl@5
does not, you can add some restrictions to the start (require space or start of line) and end (require space or end of line):
r'(^|\s)yl@(\d+)($|\s)'
Or, if you have situations where yl
is name-spaced, like a:yl
, you can add that as well:
r'(^|\s)([a-z]+:)?l@(\d+)($|\s)'
However, all this is just building more specific expressions using the regular expression language. A very good tool for writing regex I enjoy using (no affiliation) is RegexBuddy, but there are pretty good online tools as well, like https://regex101.com/.
Used in a code example basically doing what you suggested:
import re
from pandas import DataFrame
df = DataFrame({
'Format': ['A', 'B', 'C'],
'Message': ['ab@1 yl@5 rd@20 pp@40', 'bc@1 gn@7 yl@20 ss@25 rd@50', 'cc@1 yl@9 rd@20'],
'time': [3, 21, 22]
})
def determine_status(row):
def find(tag, message):
match = re.search(rf"{tag}@(\d+)", message)
if match:
return match.group(1)
else:
raise ValueError(f'{tag} not in message.')
yl = int(find('yl', row['Message']))
rd = int(find('rd', row['Message']))
time = int(row['time'])
if time < yl < rd:
return 'g'
if yl <= time < rd:
return 'y'
return 'r'
df['status'] = df.apply(determine_status, axis=1)
print(df)
The find
function takes a tag and a message and produces the numerical value for the tag in the message using a regular expression.
The determine_status
function does just that - it expects a row from a DataFrame and will use the Message
and time
column to determine a status and returns it.
df.apply
is then used to create a new status
column and fill it with the result of determine_status
for every row in the DataFrame.
You didn't specify what version of Python you are using, but if it's a version before Python 3.6, you'll find that the expressions like f'{tag} not in message.'
won't work - instead you'd use something like '{tag} not in message.'.format(tag=tag)
.
I think, this can be done with built in string functions. Try this!
def f(mess):
p1 = mess.find('yl')
p2 = mess.find('rd')
return int(mess[p1+3:].split(' ')[0]),int(mess[p2+3:].split(' ')[0])
df['vals'] =df['Message'].apply(f)
df['status'] = df.apply(lambda row: 'g' if min(row['vals']) > row.time \
else 'y' if row.vals[1]>row.time \
else 'r', axis=1)
print(df)
output:
Format Message time vals status
0 A ab@1 yl@5 rd@20 pp@40 3 (5, 20) g
1 B bc@1 yl@20 ss@25 rd@50 21 (20, 50) y
2 C cc@1 yl@9 rd@20 22 (9, 20) r