How to remove tags from a string in python using regular expressions? (NOT in HTML)

前端 未结 6 1115
终归单人心
终归单人心 2020-12-07 23:22

I need to remove tags from a string in python.

Title

What is the most effici

相关标签:
6条回答
  • 2020-12-08 00:06

    Please avoid using regex. Eventhough regex will work on your simple string, but you'd get problem in the future if you get a complex one.

    You can use BeautifulSoup get_text() feature.

    from bs4 import BeautifulSoup
    
    text = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
    soup = BeautifulSoup(text)
    
    print(soup.get_text())
    
    0 讨论(0)
  • 2020-12-08 00:07

    Searching this regex and replacing it with an empty string should work.

    /<[A-Za-z\/][^>]*>/
    

    Example (from python shell):

    >>> import re
    >>> my_string = '<FNT name="Century Schoolbook" size="22">Title</FNT>'
    >>> print re.sub('<[A-Za-z\/][^>]*>', '', my_string)
    Title
    
    0 讨论(0)
  • This should work:

    import re
    re.sub('<[^>]*>', '', mystring)
    

    To everyone saying that regexes are not the correct tool for the job:

    The context of the problem is such that all the objections regarding regular/context-free languages are invalid. His language essentially consists of three entities: a = <, b = >, and c = [^><]+. He wants to remove any occurrences of acb. This fairly directly characterizes his problem as one involving a context-free grammar, and it is not much harder to characterize it as a regular one.

    I know everyone likes the "you can't parse HTML with regular expressions" answer, but the OP doesn't want to parse it, he just wants to perform a simple transformation.

    0 讨论(0)
  • 2020-12-08 00:14

    If it's only for parsing and retrieving value, you might take a look at BeautifulStoneSoup.

    0 讨论(0)
  • 2020-12-08 00:21

    If the source text is well-formed XML, you can use the stdlib module ElementTree:

    import xml.etree.ElementTree as ET
    mystring = """<FNT name="Century Schoolbook" size="22">Title</FNT>"""
    element = ET.XML(mystring)
    print element.text  # 'Title'
    

    If the source isn't well-formed, BeautifulSoup is a good suggestion. Using regular expressions to parse tags is not a good idea, as several posters have pointed out.

    0 讨论(0)
  • 2020-12-08 00:23

    Use an XML parser, such as ElementTree. Regular expressions are not the right tool for this job.

    0 讨论(0)
提交回复
热议问题