Scrape title by only downloading relevant part of webpage

前端未结

关注

 6  1722

深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答

深忆病人 (楼主)

2021-02-05 11:36

Question: ... the only place I can optimize is likely to not read in the entire page.

This does not read the entire page.

Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

For instance:

import re
try:
    # PY3
    from urllib import request
except:
    import urllib2 as request

for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
    f = request.urlopen(url)
    re_obj = re.compile(r'.*((.*).*)',re.DOTALL)
    Found = False
    data = ''
    while True:
        b_data = f.read(4096)
        if not b_data: break

        data += b_data.decode(errors='ignore')
        match = re_obj.match(data)
        if match:
            Found = True
            title = match.groups()[1]
            print('title={}'.format(title))
            break

    f.close()

Output:
title=Welcome to Python.org
title=Google
title=Bitly | URL Shortener and Link Management Platform

Tested with Python: 3.4.2 and 2.7.9

0 讨论(0)

查看其它6个回答