How can I retrieve the page title of a webpage using Python?

后端 未结 11 1743
南笙
南笙 2020-12-07 08:55

How can I retrieve the page title of a webpage (title html tag) using Python?

相关标签:
11条回答
  • 2020-12-07 09:28

    Using HTMLParser:

    from urllib.request import urlopen
    from html.parser import HTMLParser
    
    
    class TitleParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.match = False
            self.title = ''
    
        def handle_starttag(self, tag, attributes):
            self.match = tag == 'title'
    
        def handle_data(self, data):
            if self.match:
                self.title = data
                self.match = False
    
    url = "http://example.com/"
    html_string = str(urlopen(url).read())
    
    parser = TitleParser()
    parser.feed(html_string)
    print(parser.title)  # prints: Example Domain
    
    0 讨论(0)
  • 2020-12-07 09:32

    No need to import other libraries. Request has this functionality in-built.

    >> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
    >>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
    >>> al = n.text
    >>> al[al.find('<title>') + 7 : al.find('</title>')]
    u'Friends (TV Series 1994\u20132004) - IMDb' 
    
    0 讨论(0)
  • 2020-12-07 09:34

    This is probably overkill for such a simple task, but if you plan to do more than that, then it's saner to start from these tools (mechanize, BeautifulSoup) because they are much easier to use than the alternatives (urllib to get content and regexen or some other parser to parse html)

    Links: BeautifulSoup mechanize

    #!/usr/bin/env python
    #coding:utf-8
    
    from BeautifulSoup import BeautifulSoup
    from mechanize import Browser
    
    #This retrieves the webpage content
    br = Browser()
    res = br.open("https://www.google.com/")
    data = res.get_data() 
    
    #This parses the content
    soup = BeautifulSoup(data)
    title = soup.find('title')
    
    #This outputs the content :)
    print title.renderContents()
    
    0 讨论(0)
  • 2020-12-07 09:37

    Use soup.select_one to target title tag

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('url')
    soup = bs(r.content, 'lxml')
    print(soup.select_one('title').text)
    
    0 讨论(0)
  • 2020-12-07 09:37

    Here is a fault tolerant HTMLParser implementation.
    You can throw pretty much anything at get_title() without it breaking, If anything unexpected happens get_title() will return None.
    When Parser() downloads the page it encodes it to ASCII regardless of the charset used in the page ignoring any errors. It would be trivial to change to_ascii() to convert the data into UTF-8 or any other encoding. Just add an encoding argument and rename the function to something like to_encoding().
    By default HTMLParser() will break on broken html, it will even break on trivial things like mismatched tags. To prevent this behavior I replaced HTMLParser()'s error method with a function that will ignore the errors.

    #-*-coding:utf8;-*-
    #qpy:3
    #qpy:console
    
    ''' 
    Extract the title from a web page using
    the standard lib.
    '''
    
    from html.parser import HTMLParser
    from urllib.request import urlopen
    import urllib
    
    def error_callback(*_, **__):
        pass
    
    def is_string(data):
        return isinstance(data, str)
    
    def is_bytes(data):
        return isinstance(data, bytes)
    
    def to_ascii(data):
        if is_string(data):
            data = data.encode('ascii', errors='ignore')
        elif is_bytes(data):
            data = data.decode('ascii', errors='ignore')
        else:
            data = str(data).encode('ascii', errors='ignore')
        return data
    
    
    class Parser(HTMLParser):
        def __init__(self, url):
            self.title = None
            self.rec = False
            HTMLParser.__init__(self)
            try:
                self.feed(to_ascii(urlopen(url).read()))
            except urllib.error.HTTPError:
                return
            except urllib.error.URLError:
                return
            except ValueError:
                return
    
            self.rec = False
            self.error = error_callback
    
        def handle_starttag(self, tag, attrs):
            if tag == 'title':
                self.rec = True
    
        def handle_data(self, data):
            if self.rec:
                self.title = data
    
        def handle_endtag(self, tag):
            if tag == 'title':
                self.rec = False
    
    
    def get_title(url):
        return Parser(url).title
    
    print(get_title('http://www.google.com'))
    
    0 讨论(0)
  • 2020-12-07 09:39

    I'll always use lxml for such tasks. You could use beautifulsoup as well.

    import lxml.html
    t = lxml.html.parse(url)
    print t.find(".//title").text
    

    EDIT based on comment:

    from urllib2 import urlopen
    from lxml.html import parse
    
    url = "https://www.google.com"
    page = urlopen(url)
    p = parse(page)
    print p.find(".//title").text
    
    0 讨论(0)
提交回复
热议问题