Is it possible not to add namespace for the tag when using html5parser from the lxml.html package?
Example:
from lxml import html
print(html.parse('http://example.com').getroot().tag)
# You will get 'html'
from lxml.html import html5parser
print(html5parser.parse('http://example.com').getroot().tag)
# You will get '{http://www.w3.org/1999/xhtml}html'
The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all?
There is a specific namespaceHTMLElements
boolean flag that controls this behavior:
from lxml.html import html5parser
from html5lib import HTMLParser
root = html5parser.parse('http://example.com',
parser=HTMLParser(namespaceHTMLElements=False))
print(root.tag) # prints "html"
来源:https://stackoverflow.com/questions/35012693/how-to-remove-namespace-value-from-inside-lxml-html-html5paser-element-tag