Python Regex[Forking] - Capture Groups Based on Terms but Skipping if another Term in the set is encountered

北战南征 提交于 2019-12-23 03:34:15

问题


First of all, I'm forking off of this question by @checkmate because the solutions posted do not accurately satisfy what he posted in his "Expected Output." I'm not sure if he wasn't paying attention or just posted incorrectly, but solving this accurately can really help 'me' out in personal projects of mine: Get number present after a particular pattern of a matching string in Python

In his expected output he posts:

This is the expected output:

Sample output:

{'Ref.': 'UV1234'}
{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}

Please note that "tramite" is explicitly ignored in his "Expected Output." Note too that he posts his expected output incorrectly at the line "{'Ref.': 'UV1234'}" because 'UV1234' never appears in the string. I think he meant "{'Ref.': '1234567'}". And yes, I've tried chatting them both, but no luck.

.

In response, I came up with an ultra specific solution which skips "tramite", but with just a mild degree of variance the regex will be broken. Additionally, because the line with "Ref.:" is present and is followed by "Expedien N° [Numbers]" edits to the regex yield "Ref." being captured along with the "[Numbers]" and "Expedien N°" being ignored, instead of "Expedien N° [Numbers]" (an example of this flawed variant follows below). And I do prefer to use "re.findall" but I am well aware that it does not recursively loop through the string. If what I get to below is only possible with "re.search", I still need to figure out how to solve it with that as well..

Get number present after a particular pattern of a matching string in Python

>>> import re

>>> string = '''some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.: 
tramite  1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content'''

>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.(?!:[\S\s]{,11}Expedien)|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]

The Flaws:

- To capture correctly it relies in part on "Ref.(?!:[\S\s]{,11}Expedien)"

- First of all that "11" needs to be edited to account for other lengths of info that may be present between the capture group and I can't figure that out, so right now it is not flexible

- Secondly, if in the string what needs to be captured is instead followed by "Reference" or another of my list of terms, as opposed to "Expedien" (again,it is too specific) then the third "Ref." will be captured incorrectly

.

.

.

And in this slight variant where I don't specify the range of 11 and eliminate the lookbehind for "Ref.", "Ref." gets captured along with the numbers and "Expedien N°" which should have been capture instead of "Ref.", is ignored

>>> re.findall('(?:(Expedien[\s]+N\S|Ref\.|Reference|Expedien))[\S\s]*?([A-Z\-]*(?:[\d]+)[\S]*)', string)

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Ref.', '18-00777'), ('Expedien N\xb0', '18-0022995')]

.

.

.

So, I was wondering:

How to make the regex not capture if the desired query occurs between one term out of a list I have, and another term that exists in that list?

.

.

The Desired Output follows below, but I want to know how to get it more reliably because what I have above is ultra-specific:

[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('Ref.', '1234567'), ('Expedien N\xb0', '18-00777'), ('Expedien N\xb0', '18-0022995')]


回答1:


Bit longish but this regex should work for you with a negative lookahead:

(Ref\.:|Reference|Expediente|Expediente No|Expedien N°|Exp\.No|Expedien)\s*(?:(?!Ref\.:|Reference|Expediente|Expediente No|Expedien N°|Exp\.No|Expedien).)*?([A-Z]*\d+(?:-[A-Z]*\d+)*)

RegEx Demo

(?!...) is negative lookahead to make sure we don't match overlapping tags.



来源:https://stackoverflow.com/questions/55109460/python-regexforking-capture-groups-based-on-terms-but-skipping-if-another-te

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!