Regex to compare and extract alphabet characters using python

前端未结

关注

 2  745

Hi I have a dataset as shown below:

Format,Message,time
A,ab@1 yl@5 rd@20 pp@40,3
B,bc@1 gn@7 yl@20 ss@25 rd@50, 21
C,cc@1 yl@9 rd@20, 22

I wou

相关标签:

2条回答

刺人心

2021-01-22 08:40
Your question is really a number of questions. From the 'dataframe' tag, it appears you're doing this using pandas. The regular expression you're asking about could extra the numbers for 'yl' and 'rd' (if any, I'm assuming they are always there). But a regular expression typically doesn't do math or numerical comparisons, so that's a third bit.

A regular expression to match the numerical value for 'yl' (assuming integer, not float):
```
r'yl@(\d+)'
```
You could extract them in a single expression, but that would assume they are always in the same order, or become a complicated regular expression.

To ensure only yl@5 gets matched, but something like xyl@5 does not, you can add some restrictions to the start (require space or start of line) and end (require space or end of line):
```
r'(^|\s)yl@(\d+)($|\s)'
```
Or, if you have situations where yl is name-spaced, like a:yl, you can add that as well:
```
r'(^|\s)([a-z]+:)?l@(\d+)($|\s)'
```
However, all this is just building more specific expressions using the regular expression language. A very good tool for writing regex I enjoy using (no affiliation) is RegexBuddy, but there are pretty good online tools as well, like https://regex101.com/.

Used in a code example basically doing what you suggested:
```
import re
from pandas import DataFrame

df = DataFrame({
    'Format': ['A', 'B', 'C'],
    'Message': ['ab@1 yl@5 rd@20 pp@40', 'bc@1 gn@7 yl@20 ss@25 rd@50', 'cc@1 yl@9 rd@20'],
    'time': [3, 21, 22]
})


def determine_status(row):
    def find(tag, message):
        match = re.search(rf"{tag}@(\d+)", message)
        if match:
            return match.group(1)
        else:
            raise ValueError(f'{tag} not in message.')

    yl = int(find('yl', row['Message']))
    rd = int(find('rd', row['Message']))

    time = int(row['time'])
    if time < yl < rd:
        return 'g'
    if yl <= time < rd:
        return 'y'
    return 'r'


df['status'] = df.apply(determine_status, axis=1)

print(df)
```
The find function takes a tag and a message and produces the numerical value for the tag in the message using a regular expression.

The determine_status function does just that - it expects a row from a DataFrame and will use the Message and time column to determine a status and returns it.

df.apply is then used to create a new status column and fill it with the result of determine_status for every row in the DataFrame.

You didn't specify what version of Python you are using, but if it's a version before Python 3.6, you'll find that the expressions like f'{tag} not in message.' won't work - instead you'd use something like '{tag} not in message.'.format(tag=tag).
0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2021-01-22 08:58

I think, this can be done with built in string functions. Try this!

def f(mess):
    p1 = mess.find('yl')
    p2 = mess.find('rd')
    return int(mess[p1+3:].split(' ')[0]),int(mess[p2+3:].split(' ')[0])

df['vals'] =df['Message'].apply(f) 

df['status'] = df.apply(lambda row:  'g' if min(row['vals']) > row.time \
                        else 'y' if row.vals[1]>row.time  \
                        else 'r', axis=1)

print(df)

output:

  Format                  Message  time      vals status
0      A    ab@1 yl@5 rd@20 pp@40     3   (5, 20)      g
1      B  bc@1  yl@20 ss@25 rd@50    21  (20, 50)      y
2      C          cc@1 yl@9 rd@20    22   (9, 20)      r

0 讨论(0)