How can I retrieve the page title of a webpage using Python?

后端 未结 11 1746
南笙
南笙 2020-12-07 08:55

How can I retrieve the page title of a webpage (title html tag) using Python?

相关标签:
11条回答
  • 2020-12-07 09:40

    Using lxml...

    Getting it from page meta tagged according to the Facebook opengraph protocol:

    import lxml.html.parse
    html_doc = lxml.html.parse(some_url)
    
    t = html_doc.xpath('//meta[@property="og:title"]/@content')[0]
    

    or using .xpath with lxml:

    t = html_doc.xpath(".//title")[0].text
    
    0 讨论(0)
  • 2020-12-07 09:43

    soup.title.string actually returns a unicode string. To convert that into normal string, you need to do string=string.encode('ascii','ignore')

    0 讨论(0)
  • 2020-12-07 09:44

    Here's a simplified version of @Vinko Vrsalovic's answer:

    import urllib2
    from BeautifulSoup import BeautifulSoup
    
    soup = BeautifulSoup(urllib2.urlopen("https://www.google.com"))
    print soup.title.string
    

    NOTE:

    • soup.title finds the first title element anywhere in the html document

    • title.string assumes it has only one child node, and that child node is a string

    For beautifulsoup 4.x, use different import:

    from bs4 import BeautifulSoup
    
    0 讨论(0)
  • 2020-12-07 09:46

    The mechanize Browser object has a title() method. So the code from this post can be rewritten as:

    from mechanize import Browser
    br = Browser()
    br.open("http://www.google.com/")
    print br.title()
    
    0 讨论(0)
  • 2020-12-07 09:48

    Using regular expressions

    import re
    match = re.search('<title>(.*?)</title>', raw_html)
    title = match.group(1) if match else 'No title'
    
    0 讨论(0)
提交回复
热议问题