Can't figure out how to invoke html5Tidy from Python 3

坚强是说给别人听的谎言 提交于 2020-01-06 06:49:47

问题


For Python 3.5.

Can someone please point me to some documentation for using html5tidy with Python 3? I'm amazed that multiple searches don't return anything.

In Python 3, the documentation in html5tidy.py states:

"""
HTML5Tidy
=========

Simple wrapper around html5lib & lxml.etree to "tidy" html in the wild to
well-formed xml/html

Usage
-----

    >>> from html5tidy import tidy
    >>> tidy('some text')
    '<html><head/><body>some text</body></html>'

Dependencies
------------

* [html5lib](http://code.google.com/p/html5lib/)
* [lxml](http://lxml.de/)

Okay, so I have all the pieces:

>>> import html5lib
>>> dir(html5lib)
['HTMLParser', '__all__', '__builtins__', '__cached__', [and so on]]
>>> 
>>> import lxml
>>> dir(lxml)
['__builtins__', '__cached__', '__doc__', '__file__', [and so on]]

BUT I note that dir(tidy) returns only double-underscore results:

>>> from html5tidy import tidy
>>> dir(tidy)
['__annotations__', '__call__', '__class__', [and so on...]'__subclasshook__']

So I open a file containing HTML as untidiedHTML.

>>> print(untidiedHTML)
<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="TH" style="" xmlns:ng="http://angularjs.org">
 <head ng-controller="DZHeadController">
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <title ng-bind="service.title">
   What the Heck Is OAuth? - DZone Security
  </title>
  <link href="WhatIsOAuth0200_files/tranquility.css" rel="stylesheet" type="text/css"/>
 </head>
 <body class="tranquility" >
 ... and so on...

Then per the HTML5 tidy documentation I try:

from html5tidy import tidy
tidiedHTML = tidy(untidiedHTML)

That produces:

Traceback (most recent call last):
  File "[path to my Python source file].py", line 50, in <module>
    tidiedHTML = tidy(untidiedHTML)
  File "/usr/local/lib/python3.5/dist-packages/html5tidy.py", line 61, in tidy
    parts = [parser.parse(src, encoding=encoding, parseMeta=parseMeta, useChardet=useChardet)]
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 289, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 130, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
    return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'parseMeta'

I have NO idea what to do. I've searched for documentation that explains how to invoke html5tidy from Python 3 but I've come up empty...


回答1:


That library is broken and/or doesn't work with python 3.5. I installed and ran into errors related to html5lib.HTMLParser https://github.com/aleray/html5tidy/blob/master/html5tidy.py#L57

Theres one contributor and the package has not been updated in 6 years

Your options are

  • fork the repo, fix the issues and submit a pull-request
  • extract the code you need and roll-your-own
  • find another library


来源:https://stackoverflow.com/questions/50380799/cant-figure-out-how-to-invoke-html5tidy-from-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!