发表新帖

发表新帖

How can I retrieve the page title of a webpage using Python?

后端未结

关注

 11  1748

How can I retrieve the page title of a webpage (title html tag) using Python?

相关标签:

11条回答

广开言路

2020-12-07 09:40
Using lxml...

Getting it from page meta tagged according to the Facebook opengraph protocol:
```
import lxml.html.parse
html_doc = lxml.html.parse(some_url)

t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]
```
or using .xpath with lxml:
```
t = html_doc.xpath(".//title")[0].text
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-12-07 09:43

soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-07 09:44
Here's a simplified version of @Vinko Vrsalovic's answer:
```
import urllib2
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
print soup.title.string
```
NOTE:
- soup.title finds the first title element anywhere in the html document
- title.string assumes it has only one child node, and that child node is a string
For beautifulsoup 4.x, use different import:
```
from bs4 import BeautifulSoup
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
星月不相逢

2020-12-07 09:46
The mechanize Browser object has a title() method. So the code from this post can be rewritten as:
```
from mechanize import Browser
br = Browser()
br.open("http://www.google.com/")
print br.title()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

2020-12-07 09:48

Using regular expressions

import re
match = re.search('<title>(.*?)</title>', raw_html)
title = match.group(1) if match else 'No title'

0 讨论(0)

上一页 1 2

热议问题