How to get page title in requests

后端 未结 5 1591
伪装坚强ぢ
伪装坚强ぢ 2020-12-28 22:53

What would be the simplest way to get the title of a page in Requests?

r = requests.get(\'http://www.imdb.com/title/tt0108778/\')
# ? r.title
Friends (TV Ser         


        
相关标签:
5条回答
  • 2020-12-28 22:56

    Regex with lookbehind and lookforward:

    re.search('(?<=<title>).+?(?=</title>)', mytext, re.DOTALL).group().strip()
    

    re.DOTALL because title can have a new line character \n

    0 讨论(0)
  • 2020-12-28 23:07

    You need an HTML parser to parse the HTML response and get the title tag's text:

    Example using lxml.html:

    >>> import requests
    >>> from lxml.html import fromstring
    >>> r = requests.get('http://www.imdb.com/title/tt0108778/')
    >>> tree = fromstring(r.content)
    >>> tree.findtext('.//title')
    u'Friends (TV Series 1994\u20132004) - IMDb'
    

    There are certainly other options, like, for example, mechanize library:

    >>> import mechanize
    >>> br = mechanize.Browser()
    >>> br.open('http://www.imdb.com/title/tt0108778/')
    >>> br.title()
    'Friends (TV Series 1994\xe2\x80\x932004) - IMDb'
    

    What option to choose depends on what are you going to do next: parse the page to get more data, or, may be, you want to interact with it: click buttons, submit forms, follow links etc.

    Besides, you may want to use an API provided by IMDB, instead of going down to HTML parsing, see:

    • Does IMDB provide an API?
    • IMDbPY

    Example usage of an IMDbPY package:

    >>> from imdb import IMDb
    >>> ia = IMDb()
    >>> movie = ia.get_movie('0108778')
    >>> movie['title']
    u'Friends'
    >>> movie['series years']
    u'1994-2004'
    
    0 讨论(0)
  • 2020-12-28 23:07

    Pythonic HTML Parsing for Humans.

    from requests_html import HTMLSession
    
    print(HTMLSession().get('http://www.imdb.com/title/tt0108778/').html.find('title', first=True).text)
    
    0 讨论(0)
  • 2020-12-28 23:10

    No need to import other libraries. Request has this functionality in-built.

    >>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
    >>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
    >>> al = n.text
    >>> al[al.find('<title>') + 7 : al.find('</title>')]
    u'Friends (TV Series 1994\u20132004) - IMDb'
    

    Update after ZN13's comment

    >>> import re
    >>> import requests
    >>> n = requests.get('https://www.libsdl.org/release/SDL-1.2.15/docs/html/guideinputkeyboard.html')
    >>> al = n.text
    >>> d = re.search('<\W*title\W*(*)</title', al, re.IGNORECASE)
    >>> d.group(1)
    u'Handling the Keyboard'
    

    This will work for all cases whether extra non alphabetical characters are present with title tag or not.

    0 讨论(0)
  • 2020-12-28 23:12

    You could use beautifulsoup to parse the HTML.

    Install it using pip install beautifulsoup4

    >>> import requests
    >>> r = requests.get('http://www.imdb.com/title/tt0108778/')
    >>> import bs4
    >>> html = bs4.BeautifulSoup(r.text)
    >>> html.title
    <title>Friends (TV Series 1994–2004) - IMDb</title>
    >>> html.title.text
    u'Friends (TV Series 1994\u20132004) - IMDb'
    
    0 讨论(0)
提交回复
热议问题