问题
For Python 3.5.
Can someone please point me to some documentation for using html5tidy with Python 3? I'm amazed that multiple searches don't return anything.
In Python 3, the documentation in html5tidy.py states:
"""
HTML5Tidy
=========
Simple wrapper around html5lib & lxml.etree to "tidy" html in the wild to
well-formed xml/html
Usage
-----
>>> from html5tidy import tidy
>>> tidy('some text')
'<html><head/><body>some text</body></html>'
Dependencies
------------
* [html5lib](http://code.google.com/p/html5lib/)
* [lxml](http://lxml.de/)
Okay, so I have all the pieces:
>>> import html5lib
>>> dir(html5lib)
['HTMLParser', '__all__', '__builtins__', '__cached__', [and so on]]
>>>
>>> import lxml
>>> dir(lxml)
['__builtins__', '__cached__', '__doc__', '__file__', [and so on]]
BUT I note that dir(tidy) returns only double-underscore results:
>>> from html5tidy import tidy
>>> dir(tidy)
['__annotations__', '__call__', '__class__', [and so on...]'__subclasshook__']
So I open a file containing HTML as untidiedHTML.
>>> print(untidiedHTML)
<!DOCTYPE html>
<html id="ng-app" lang="en" ng-app="TH" style="" xmlns:ng="http://angularjs.org">
<head ng-controller="DZHeadController">
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title ng-bind="service.title">
What the Heck Is OAuth? - DZone Security
</title>
<link href="WhatIsOAuth0200_files/tranquility.css" rel="stylesheet" type="text/css"/>
</head>
<body class="tranquility" >
... and so on...
Then per the HTML5 tidy documentation I try:
from html5tidy import tidy
tidiedHTML = tidy(untidiedHTML)
That produces:
Traceback (most recent call last):
File "[path to my Python source file].py", line 50, in <module>
tidiedHTML = tidy(untidiedHTML)
File "/usr/local/lib/python3.5/dist-packages/html5tidy.py", line 61, in tidy
parts = [parser.parse(src, encoding=encoding, parseMeta=parseMeta, useChardet=useChardet)]
File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 289, in parse
self._parse(stream, False, None, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/html5lib/html5parser.py", line 130, in _parse
self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/html5lib/_tokenizer.py", line 36, in __init__
self.stream = HTMLInputStream(stream, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/html5lib/_inputstream.py", line 149, in HTMLInputStream
return HTMLUnicodeInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'parseMeta'
I have NO idea what to do. I've searched for documentation that explains how to invoke html5tidy from Python 3 but I've come up empty...
回答1:
That library is broken and/or doesn't work with python 3.5. I installed and ran into errors related to html5lib.HTMLParser https://github.com/aleray/html5tidy/blob/master/html5tidy.py#L57
Theres one contributor and the package has not been updated in 6 years
Your options are
- fork the repo, fix the issues and submit a pull-request
- extract the code you need and roll-your-own
- find another library
来源:https://stackoverflow.com/questions/50380799/cant-figure-out-how-to-invoke-html5tidy-from-python-3