问题
I use regex to match certain expressions within a text.
assume I want to match a number, or numbers separated by commas -including or not spaces-, all within parenthesis in a text. (in reality the matches are more complex including spaces etc)
I do the following:
import re
pattern =re.compile(r"(\()([0-9]+(,)?( )?)+(\))")
matches = pattern.findall(content)
matches is a list with the matches,
for i,match in enumerate(matches):
print(i,match)
Example text:
Lorem ipsum dolor sit amet (12,16) , consectetur 23 adipiscing elit. Curabitur (45) euismod scelerisque consectetur. Vivamus aliquam velit (46,48,49) at augue faucibus, id eleifend purus egestas. Aliquam vitae mauris cursus, facilisis enim condimentum, vestibulum enim. Praesent
QUESTION 1 How do I get the list of FULL matches like:
matches=[ "(12,16)", "(45)", "(46,48,49)"]
QUESTION 2: how do I get a list with the n-preceeding words of the Full match? I am trying to split the text in words. A problem here is that the hit (12,16) might be several times in the text. A second problem when using:
mywordlist=text.split(' ')
might split as well the match in case I want to catch punctuation separate from the words, and in case there are spaces within the (). In the example the words I want to get back are the ones underlined manually in the picture. 4-words before the match:
"ipsum dolor sit amet" (12,16)
"adipiscing elit. Curabitur" (45)
". Vivamus aliquam velit" (46,48,49)
AFTER SOME COMMENTS: print(matches) gives me:
matches = pattern.findall(content)
print('the matches are:')
print('type of variable matches',type(matches))
print(matches)
[('(', '16', ',', ')'), ('(', '45', '', ')'), ('(', '49', ',', ')')]
回答1:
Sample code with changed regex - test here: https://regex101.com/r/mV1l3E/3
import re
regex = r"(\w+ (?=\(\d))(\([\d,]+\))"
test_str = """bla kra tu (34) blaka trutra (33,45) afda
bla kra tu (34) blaka trutra (33,45) afdabla kra tu (34) blaka trutra (33,45) afda
bla kra tu (34) blaka trutra (33,45) afda"""
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)
for first_matching_group, number_group in matches:
print(first_matching_group, "===>", number_group)
Output:
# matches (each a tuple of both matches
[('kra tu ', '(34)'), ('blaka trutra ', '(33,45)'), ('kra tu ', '(34)'),
('blaka trutra ', '(33,45)'), ('kra tu ', '(34)'), ('blaka trutra ', '(33,45)'),
('kra tu ', '(34)'), ('blaka trutra ', '(33,45)')]
# for loop output
('kra tu ', '===>', '(34)')
('blaka trutra ', '===>', '(33,45)')
('kra tu ', '===>', '(34)')
('blaka trutra ', '===>', '(33,45)')
('kra tu ', '===>', '(34)')
('blaka trutra ', '===>', '(33,45)')
('kra tu ', '===>', '(34)')
('blaka trutra ', '===>', '(33,45)')
Pattern explanation:
(\w+ (?=\(\d))(\([\d,]+\))
--------------============
Two groups in the pattern, the ------
group looks for 2 words seperated by spaces unsing multiple word characters (\w+
) with a lookahead for opening opening parenthesis and one digit (you may want to include the full second pattern here to avoid mis-matches). The second pattern ========
looks for parenthesis +multiple digits and commas followed by closing parenthesis.
The link to regexr101 https://regex101.com/r/mV1l3E/3/ explains it much better and in color if you copy the pattern in its regex field.
The pattern will not find any (42) with not 2 words before it - you will have to play around a bit if that is a use case as well.
Edit:
Maybe slightly better regex: r'((?:\w+ ?){1,5}(?=\(\d))(\([\d,]+\))'
- needs only 1 word before (https://regex101.com/r/mV1l3E/5/)
来源:https://stackoverflow.com/questions/54055163/how-to-match-regex-expression-and-get-precedent-words