What would be the simplest way to get the title of a page in Requests?
r = requests.get(\'http://www.imdb.com/title/tt0108778/\')
# ? r.title
Friends (TV Ser
Regex with lookbehind and lookforward:
re.search('(?<=<title>).+?(?=</title>)', mytext, re.DOTALL).group().strip()
re.DOTALL
because title can have a new line character \n
You need an HTML parser to parse the HTML response and get the title
tag's text:
Example using lxml.html:
>>> import requests
>>> from lxml.html import fromstring
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> tree = fromstring(r.content)
>>> tree.findtext('.//title')
u'Friends (TV Series 1994\u20132004) - IMDb'
There are certainly other options, like, for example, mechanize library:
>>> import mechanize
>>> br = mechanize.Browser()
>>> br.open('http://www.imdb.com/title/tt0108778/')
>>> br.title()
'Friends (TV Series 1994\xe2\x80\x932004) - IMDb'
What option to choose depends on what are you going to do next: parse the page to get more data, or, may be, you want to interact with it: click buttons, submit forms, follow links etc.
Besides, you may want to use an API provided by IMDB
, instead of going down to HTML parsing, see:
Example usage of an IMDbPY
package:
>>> from imdb import IMDb
>>> ia = IMDb()
>>> movie = ia.get_movie('0108778')
>>> movie['title']
u'Friends'
>>> movie['series years']
u'1994-2004'
Pythonic HTML Parsing for Humans.
from requests_html import HTMLSession
print(HTMLSession().get('http://www.imdb.com/title/tt0108778/').html.find('title', first=True).text)
No need to import other libraries. Request has this functionality in-built.
>>> hearders = {'headers':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0'}
>>> n = requests.get('http://www.imdb.com/title/tt0108778/', headers=hearders)
>>> al = n.text
>>> al[al.find('<title>') + 7 : al.find('</title>')]
u'Friends (TV Series 1994\u20132004) - IMDb'
Update after ZN13's comment
>>> import re
>>> import requests
>>> n = requests.get('https://www.libsdl.org/release/SDL-1.2.15/docs/html/guideinputkeyboard.html')
>>> al = n.text
>>> d = re.search('<\W*title\W*(*)</title', al, re.IGNORECASE)
>>> d.group(1)
u'Handling the Keyboard'
This will work for all cases whether extra non alphabetical characters are present with title tag or not.
You could use beautifulsoup to parse the HTML.
Install it using pip install beautifulsoup4
>>> import requests
>>> r = requests.get('http://www.imdb.com/title/tt0108778/')
>>> import bs4
>>> html = bs4.BeautifulSoup(r.text)
>>> html.title
<title>Friends (TV Series 1994–2004) - IMDb</title>
>>> html.title.text
u'Friends (TV Series 1994\u20132004) - IMDb'