I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus
You can defer downloading the entire response body by enabling stream mode of requests
.
Requests 2.14.2 documentation - Advanced Usage
By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the
Response.content
attribute with thestream
parameter:...
If you set
stream
toTrue
when making a request, Requests cannot release the connection back to the pool unless you consume all the data or callResponse.close
. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while usingstream=True
, you should consider using contextlib.closing (documented here)
So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.
Here's an error-prone code tested with Python 2.7.10 and 3.6.0:
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
retitle = re.compile("]*>(.*?) ", re.IGNORECASE | re.DOTALL)
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://example.com/abc", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = retitle.search(buffer)
if match:
print(htmlp.unescape(match.group(1)))
break