print start of html tags

后端 未结 2 2018
执笔经年
执笔经年 2021-01-28 21:30

I want to print out the first html tags thats has attributes

    

test

test2

相关标签:
2条回答
  • 2021-01-28 21:42

    This seems pretty complicated, you can try with this expression, but it would fail in some cases. It would first collect the undesired instances, then at the end there is a capturing group for those desired.

    Maybe, it wouldn't be the best idea to use regular expressions here.

    Test

    import re
    
    regex = r"^\s*<\S+>\s*$|^\s*<\S+\s.*test.*?>.*?<\/\S+>$|^\s*(<.*>)\s*$"
    
    test_str = """
    
    <h1>test</h1>
        <h2>test2</h2>
        <div id="content"></div>
        <p>test3</p>
        <div class="test"></div>
        <div id="nav"></div>
        <p>test3</p>
    
    """
    
    print(re.findall(regex, test_str, re.M))
    

    Output

    ['', '', '<div id="content"></div>', '', '', '<div id="nav"></div>', '']
    

    The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.

    0 讨论(0)
  • 2021-01-28 22:01

    You should use a non-greedy match for any number of characters to the left of the =, so:

    r'<.*?=.*?>'
    

    That will match a <, followed by a minimum number of characters, followed by a =, followed by the minimum number of characters until the >.

    What you had:

    r'<?=.*?>'
    

    Means an optional <, followed by a =, followed by any string going up to the >. Since the < is optional and would only match if right before the =, you end up with no matches for it.

    0 讨论(0)
提交回复
热议问题