问题
I'm having a few problems with parsing simple HTML with use of the ElementTree module out of the standard Python libraries. This is my source code:
from urllib.request import urlopen
from xml.etree.ElementTree import ElementTree
import sys
def main():
site = urlopen("http://1gabba.in/genre/hardstyle")
try:
html = site.read().decode('utf-8')
xml = ElementTree(html)
print(xml)
print(xml.findall("a"))
except:
print(sys.exc_info())
if __name__ == '__main__':
main()
Either this fails, I get the following output on my console:
<xml.etree.ElementTree.ElementTree object at 0x00000000027D14E0>
(<class 'AttributeError'>, AttributeError("'str' object has no attribute 'findall'",), <traceback object at 0x0000000002910B88>)
So xml is indeed an ElementTree object, when we look at the documentation we'll see that the ElementTree class has a findall function. Extra thingie: xml.find("a") works fine, but it returns an int instead of an Element instance.
So could anybody help me out? What I am misunderstanding?
回答1:
Replace ElementTree(html)
with ElementTree.fromstring(html)
, and change your import statement to say from xml.etree import ElementTree
.
The problem here is that the ElementTree constructor doesn't expect a string as its input -- it expects an Element
object. The function xml.etree.ElementTree.fromstring() is the easiest way to build an ElementTree from a string.
I'm guessing that an XML parser isn't what you really want for this task, given that you're parsing HTML (which is not necessarily valid XML). You might want to take a look at:
- http://www.boddie.org.uk/python/HTML.html
- Parsing HTML in Python
- http://www.crummy.com/software/BeautifulSoup/
回答2:
The line should be:
xml = ElementTree(file=html)
P.S.: The above will work only when the XML is well-structured. If there is error in XML structure or bad HTML then it will raise ParseError.
You might like to use BeautifulSoup for HTML parsing. If your want to use XPATH and lxml, you might also like html5lib.
It is as easy as:
tree = html5lib.parse(html.content, treebuilder='lxml', namespaceHTMLElements=False)
# the tree is a lxml object (parsed from any/bad html) supporting findall and find with xpaths
来源:https://stackoverflow.com/questions/9672448/urllib-combined-together-with-elementtree