Python regex - stripping out HTML tags and formatting characters from inner HTML

后端 未结 1 500
故里飘歌
故里飘歌 2021-01-29 04:37

I\'m dealing with single HTML strings like this

>> s = \'u>
\\n Some text

相关标签:
1条回答
  • 2021-01-29 04:51

    If I understand you right, you're looking to take this input:

    u><br/>\n                                    Some text <br/><br/><u
    

    And receive this output:

    \n                                    Some text 
    

    This is done simply enough by only caring about what comes between the two inward-pointing brackets. We want:

    • A right-bracket > (so we know where to begin)
    • Some text \n Some text (the content) which does not contain a left-bracket
    • A left-bracket < (so we know where to end)

    You want:

    >>> s = 'u><br/>\n                                    Some text <br/><br/><u'
    >>> re.search(r'>([^<]+)<', s)
    <_sre.SRE_Match object; span=(6, 55), match='>\n                                    Some text >
    

    (The captured group can be accessed via .group(1).)

    Additionally, you may want to use re.findall if you expect there to be multiple matches per line:

    >>> re.findall(r'>([^<]+)<', s)
    ['\n                                    Some text ']
    

    EDIT: To address the comment: If you have multiple matches and you want to connect them into a single string (effectively removing all HTML-like tag things), do:

    >>> s = 'nbsp;<br><br>Some text.<br>Some \n more text.<br'
    >>> ' '.join(re.findall(r'>([^<]+)<', s))
    'Some text. Some \n more text.'
    
    0 讨论(0)
提交回复
热议问题