How to remove namespace value from inside lxml.html.html5paser element tag

狂风中的少年 提交于 2019-12-08 04:32:09

问题


Is it possible not to add namespace for the tag when using html5parser from the lxml.html package?

Example:

from lxml import html
print(html.parse('http://example.com').getroot().tag)
# You will get 'html'

from lxml.html import html5parser
print(html5parser.parse('http://example.com').getroot().tag)
# You will get '{http://www.w3.org/1999/xhtml}html'

The easiest solution I found is to remove that using regex, but maybe it's possible not to include that text at all?


回答1:


There is a specific namespaceHTMLElements boolean flag that controls this behavior:

from lxml.html import html5parser
from html5lib import HTMLParser

root = html5parser.parse('http://example.com', 
                         parser=HTMLParser(namespaceHTMLElements=False))    
print(root.tag)  # prints "html"


来源:https://stackoverflow.com/questions/35012693/how-to-remove-namespace-value-from-inside-lxml-html-html5paser-element-tag

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!