How to detect with python if the string contains html code?

后端 未结 4 500
没有蜡笔的小新
没有蜡笔的小新 2020-12-29 21:44

How to detect either the string contains an html (can be html4, html5, just partials of html within text)? I do not need a version of HTML, but rather if the string is just

相关标签:
4条回答
  • 2020-12-29 22:24

    One way I thought of was to intersect start and end tags found by attempting to parse the text as HTML and intersecting this set with a known set of acceptable HTMl elements.

    Example:

    #!/usr/bin/env python
    
    from __future__ import print_function
    
    from HTMLParser import HTMLParser
    
    
    from html5lib.sanitizer import HTMLSanitizerMixin
    
    
    class TestHTMLParser(HTMLParser):
    
        def __init__(self, *args, **kwargs):
            HTMLParser.__init__(self, *args, **kwargs)
    
            self.elements = set()
    
        def handle_starttag(self, tag, attrs):
            self.elements.add(tag)
    
        def handle_endtag(self, tag):
            self.elements.add(tag)
    
    
    def is_html(text):
        elements = set(HTMLSanitizerMixin.acceptable_elements)
    
        parser = TestHTMLParser()
        parser.feed(text)
    
        return True if parser.elements.intersection(elements) else False
    
    
    print(is_html("foo bar"))
    print(is_html("<p>Hello World!</p>"))
    print(is_html("<html><head><title>Title</title></head><body><p>Hello!</p></body></html>"))  # noqa
    

    Output:

    $ python foo.py
    False
    True
    True
    

    This works for partial text that contains a subset of HTML elements.

    NB: This makes use of the html5lib so it may not work for other document types necessarily but the technique can be adapted easily.

    0 讨论(0)
  • 2020-12-29 22:34

    Expanding on the previous post I would do something like this for something quick and simple:

    import sys, os
    
    if os.path.exists("file.html"):
        checkfile=open("file.html", mode="r", encoding="utf-8")
        ishtml = False
        for line in checkfile:
            line=line.strip()
            if line == "</html>"
                ishtml = True
        if ishtml:
            print("This is an html file")
        else:
            print("This is not an html file")
    
    0 讨论(0)
  • 2020-12-29 22:41

    Check for ending tags. This is simplest and most robust I believe.

    "</html>" in possibly_html
    

    If there is an ending html tag, then it looks like html, otherwise not so much.

    0 讨论(0)
  • 2020-12-29 22:43

    You can use an HTML parser, like BeautifulSoup. Note that it really tries it best to parse an HTML, even broken HTML, it can be very and not very lenient depending on the underlying parser:

    >>> from bs4 import BeautifulSoup
    >>> html = """<html>
    ... <head><title>I'm title</title></head>
    ... </html>"""
    >>> non_html = "This is not an html"
    >>> bool(BeautifulSoup(html, "html.parser").find())
    True
    >>> bool(BeautifulSoup(non_html, "html.parser").find())
    False
    

    This basically tries to find any html element inside the string. If found - the result is True.

    Another example with an HTML fragment:

    >>> html = "Hello, <b>world</b>"
    >>> bool(BeautifulSoup(html, "html.parser").find())
    True
    

    Alternatively, you can use lxml.html:

    >>> import lxml.html
    >>> html = 'Hello, <b>world</b>'
    >>> non_html = "<ht fldf d><"
    >>> lxml.html.fromstring(html).find('.//*') is not None
    True
    >>> lxml.html.fromstring(non_html).find('.//*') is not None
    False
    
    0 讨论(0)
提交回复
热议问题