Urllib combined together with elementtree

折月煮酒 提交于 2020-01-16 03:28:11

问题


I'm having a few problems with parsing simple HTML with use of the ElementTree module out of the standard Python libraries. This is my source code:

from urllib.request import urlopen
from xml.etree.ElementTree import ElementTree

import sys

def main():
    site = urlopen("http://1gabba.in/genre/hardstyle")
    try:
        html = site.read().decode('utf-8')
        xml = ElementTree(html)
        print(xml)
        print(xml.findall("a"))        
    except:
        print(sys.exc_info())

if __name__ == '__main__':
    main()

Either this fails, I get the following output on my console:

<xml.etree.ElementTree.ElementTree object at 0x00000000027D14E0>
(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'findall'",), <traceback object at 0x0000000002910B88>)

So xml is indeed an ElementTree object, when we look at the documentation we'll see that the ElementTree class has a findall function. Extra thingie: xml.find("a") works fine, but it returns an int instead of an Element instance.

So could anybody help me out? What I am misunderstanding?


回答1:


Replace ElementTree(html) with ElementTree.fromstring(html), and change your import statement to say from xml.etree import ElementTree.

The problem here is that the ElementTree constructor doesn't expect a string as its input -- it expects an Element object. The function xml.etree.ElementTree.fromstring() is the easiest way to build an ElementTree from a string.

I'm guessing that an XML parser isn't what you really want for this task, given that you're parsing HTML (which is not necessarily valid XML). You might want to take a look at:

  • http://www.boddie.org.uk/python/HTML.html
  • Parsing HTML in Python
  • http://www.crummy.com/software/BeautifulSoup/



回答2:


The line should be:

xml = ElementTree(file=html)

P.S.: The above will work only when the XML is well-structured. If there is error in XML structure or bad HTML then it will raise ParseError.

You might like to use BeautifulSoup for HTML parsing. If your want to use XPATH and lxml, you might also like html5lib.

It is as easy as:

tree = html5lib.parse(html.content, treebuilder='lxml', namespaceHTMLElements=False)
# the tree is a lxml object (parsed from any/bad html) supporting findall and find with xpaths


来源:https://stackoverflow.com/questions/9672448/urllib-combined-together-with-elementtree

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!