Clean Up HTML in Python

后端 未结 5 2067
有刺的猬
有刺的猬 2020-12-08 16:22

I\'m aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or mal

相关标签:
5条回答
  • 2020-12-08 17:02

    I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

    from bs4 import BeautifulSoup
    tree = BeautifulSoup(bad_html)
    good_html = tree.prettify()
    

    I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

    0 讨论(0)
  • 2020-12-08 17:03

    There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

    0 讨论(0)
  • 2020-12-08 17:07

    This can be done using the tidy_document function in tidylib module.

    import tidylib
    html = '<html>...</html>'
    inputEncoding = 'utf8'
    options = {
        str("output-xhtml"): True, #"output-xml" : True
        str("quiet"): True,
        str("show-errors"): 0,
        str("force-output"): True,
        str("numeric-entities"): True,
        str("show-warnings"): False,
        str("input-encoding"): inputEncoding,
        str("output-encoding"): "utf8",
        str("indent"): False,
        str("tidy-mark"): False,
        str("wrap"): 0
        };
    document, errors = tidylib.tidy_document(html, options=options)
    
    0 讨论(0)
  • 2020-12-08 17:14

    Here is an example of cleaning up HTML using the lxml.html.clean.Cleaner module:

    import sys
    
    from lxml.html.clean import Cleaner
    
    
    def sanitize(dirty_html):
        cleaner = Cleaner(page_structure=True,
                      meta=True,
                      embedded=True,
                      links=True,
                      style=True,
                      processing_instructions=True,
                      inline_style=True,
                      scripts=True,
                      javascript=True,
                      comments=True,
                      frames=True,
                      forms=True,
                      annoying_tags=True,
                      remove_unknown_tags=True,
                      safe_attrs_only=True,
                      safe_attrs=frozenset(['src','color', 'href', 'title', 'class', 'name', 'id']),
                      remove_tags=('span', 'font', 'div')
                      )
    
        return cleaner.clean_html(dirty_html)
    
    
    if __name__ == '__main__':
    
        with open(sys.argv[1]) as fin:
    
            print(sanitize(fin.read()))
    

    Check out the docs for a full list of options you can pass to the Cleaner.

    0 讨论(0)
  • 2020-12-08 17:17

    I am using lxml to convert HTML to proper (well-formed) XML:

    from lxml import etree
    tree   = etree.HTML(input_text.replace('\r', ''))
    output_text = '\n'.join([ etree.tostring(stree, pretty_print=True, method="xml") 
                              for stree in tree ])
    

    ... and doing lot of removing of 'dangerous elements' in the middle....

    0 讨论(0)
提交回复
热议问题