I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus
Question: ... the only place I can optimize is likely to not read in the entire page.
This does not read the entire page.
Note: Unicode
.decode()
willraise Exception
if you cut a Unicode sequence in the middle. Using.decode(errors='ignore')
remove those sequences.
For instance:
import re
try:
# PY3
from urllib import request
except:
import urllib2 as request
for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
f = request.urlopen(url)
re_obj = re.compile(r'.*((.*).*)',re.DOTALL)
Found = False
data = ''
while True:
b_data = f.read(4096)
if not b_data: break
data += b_data.decode(errors='ignore')
match = re_obj.match(data)
if match:
Found = True
title = match.groups()[1]
print('title={}'.format(title))
break
f.close()
Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform
Tested with Python: 3.4.2 and 2.7.9