Scrape title by only downloading relevant part of webpage

前端 未结 6 1716
深忆病人
深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

6条回答
  •  深忆病人
    2021-02-05 11:36

    Question: ... the only place I can optimize is likely to not read in the entire page.

    This does not read the entire page.

    Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

    For instance:

    import re
    try:
        # PY3
        from urllib import request
    except:
        import urllib2 as request
    
    for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
        f = request.urlopen(url)
        re_obj = re.compile(r'.*((.*).*)',re.DOTALL)
        Found = False
        data = ''
        while True:
            b_data = f.read(4096)
            if not b_data: break
    
            data += b_data.decode(errors='ignore')
            match = re_obj.match(data)
            if match:
                Found = True
                title = match.groups()[1]
                print('title={}'.format(title))
                break
    
        f.close()
    

    Output:
    title=Welcome to Python.org
    title=Google
    title=Bitly | URL Shortener and Link Management Platform

    Tested with Python: 3.4.2 and 2.7.9

提交回复
热议问题