I want to print out the first html tags thats has attributes
test
test2
This seems pretty complicated, you can try with this expression, but it would fail in some cases. It would first collect the undesired instances, then at the end there is a capturing group for those desired.
Maybe, it wouldn't be the best idea to use regular expressions here.
import re
regex = r"^\s*<\S+>\s*$|^\s*<\S+\s.*test.*?>.*?<\/\S+>$|^\s*(<.*>)\s*$"
test_str = """
<h1>test</h1>
<h2>test2</h2>
<div id="content"></div>
<p>test3</p>
<div class="test"></div>
<div id="nav"></div>
<p>test3</p>
"""
print(re.findall(regex, test_str, re.M))
['', '', '<div id="content"></div>', '', '', '<div id="nav"></div>', '']
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
You should use a non-greedy match for any number of characters to the left of the =
, so:
r'<.*?=.*?>'
That will match a <
, followed by a minimum number of characters, followed by a =
, followed by the minimum number of characters until the >
.
What you had:
r'<?=.*?>'
Means an optional <
, followed by a =
, followed by any string going up to the >
. Since the <
is optional and would only match if right before the =
, you end up with no matches for it.