python lxml 库
html = etree.HTML(str/bytes) 参数可以是str或bytes类型,返回值是etree._Element。 调用etree.parse('hello.html'),参数是文件路径,返回值是etree._ElementTree。 etree.tostring(html,encoding='unicode') 不加编码,返回bytes,加了返回str。 etree.parse()读取文件之后用xpath不成功。<html xmlns="http://www.w3.org/1999/xhtml">把xmlns属性去掉就可以。 但是用文件以二进制打开,etree.HTML再用xpath就可以。 …… 用文本文件打开,再用etree.HTML就不行。 Traceback (most recent call last): File " d:\我的文档\py\test\tieba\qu.py " , line 53, in <module> html = etree.HTML(html2) File " src\lxml\etree.pyx " , line 3178, in lxml.etree.HTML (src\lxml\etree.c:80497 ) File " src\lxml\parser.pxi " , line 1866, in lxml.etree.