How can I retrieve the page title of a webpage (title html tag) using Python?
Using lxml...
Getting it from page meta tagged according to the Facebook opengraph protocol:
import lxml.html.parse
html_doc = lxml.html.parse(some_url)
t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]
or using .xpath with lxml:
t = html_doc.xpath(".//title")[0].text
soup.title.string
actually returns a unicode string.
To convert that into normal string, you need to do
string=string.encode('ascii','ignore')
Here's a simplified version of @Vinko Vrsalovic's answer:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
NOTE:
soup.title finds the first title element anywhere in the html document
title.string assumes it has only one child node, and that child node is a string
For beautifulsoup 4.x, use different import:
from bs4 import BeautifulSoup
The mechanize Browser object has a title() method. So the code from this post can be rewritten as:
from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()
Using regular expressions
import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'