Extract integers with specific length between separators

会有一股神秘感。 提交于 2019-12-10 16:37:51


Given a list of strings like:

L = ['1759@1@83@0#1362@0.2600@25.7400@2.8600#1094@1@129.6@14.4', 

I need to extract all integers with length 4 between separators # or @, and also extract the first and last integers. No floats.

My solution is a bit overcomplicated - replace with space and then applied this solution:

pat = r'(?<!\S)\d{4}(?!\S)'
out = [re.findall(pat, re.sub('[#@]', ' ', x)) for x in L]
print (out)
[['1759', '1362', '1094'], 
 ['1354', '1101', '1108'], 
 ['1430', '1431', '1074', '1109'], 
 ['1809', '1816', '1076']]

Is it possible to change the regex for not using re.sub necessarily for replace? Is there another solution with better performance?


To allow first and last occurrences that has no leading or trailing separator you could use negative lookarounds:


(?<![^#]) is a near synonym for (?:^|#). The same applies for the negative lookahead.

See live demo here


Interesting problem!

This can be easily tackled with the concepts of lookahead & lookbehind.


pattern = "(?<!\.)(?<=[#@])\d{4}|(?<!\.)\d{4}(?=[@#])"
out = [re.findall(pattern, x) for x in L]
print (out)


[['1759', '1362', '1094', '1234'],
 ['1354', '1101', '1108'],
 ['1430', '1431', '1074', '1109'],
 ['1809', '1816', '1076', '1110']]


The above pattern is a combination of two separate patterns separated by an | (OR operator).

pattern_1 = "(?<!\.)(?<=[#@])\d{4}"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?<=[#@]) --- The 4 digits must be preceded by a hashtag(#) or at(@) POSITIVE LOOKBEHIND

pattern_2 = "(?<!\.)\d{4}(?=[@#])"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?=[@#]   --- The 4 digits must be followed by a hashtag(#) or at(@) POSITIVE LOOKAHEAD

To better understand these concepts, click here


Here is a complex list comprehension without using regex if you consider the integers of length 4 without the starting # or ending @ too :

[[n for o in p for n in o] for p in [[[m for m in k.split("@") if m.isdigit() and str(int(m))==m and len(m) ==4] for k in j.split("#")] for j in L]]

Output :

[['1759', '1362', '1094'], ['1356'], ['1354', '1101', '1108'], ['1430', '1431', '1074', '1109'], ['1809', '1816', '1076']]

