I\'m looking to write a Python script (using 3.4.3) that grabs a HTML page from a URL and can go through the DOM to try to find a specific element.
I currently have
There are many different modules you could use. For example, lxml or BeautifulSoup.
Here's an lxml
example:
import lxml.html
mysite = urllib.request.urlopen('http://www.google.com').read()
lxml_mysite = lxml.html.fromstring(mysite)
description = lxml_mysite.xpath("//meta[@name='description']")[0] # meta tag description
text = description.get('content') # content attribute of the tag
>>> print(text)
"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
And a BeautifulSoup
example:
from bs4 import BeautifulSoup
mysite = urllib.request.urlopen('http://www.google.com').read()
soup_mysite = BeautifulSoup(mysite)
description = soup_mysite.find("meta", {"name": "description"}) # meta tag description
text = description['content'] # text of content attribute
>>> print(text)
u"Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for."
Notice how BeautifulSoup
returns a unicode string, while lxml
does not. This can be useful/hurtful depending on what is needed.
Check out the BeautifulSoup module.
from bs4 import BeautifulSoup
import urllib
soup = BeautifulSoup(urllib.urlopen("http://google.com").read())
for link in soup.find_all('a'):
print(link.get('href'))