Scrape title by only downloading relevant part of webpage

前端 未结 6 1721
深忆病人
深忆病人 2021-02-05 10:45

I would like to scrape just the title of a webpage using Python. I need to do this for thousands of sites so it has to be fast. I\'ve seen previous questions like retrieving jus

相关标签:
6条回答
  • 2021-02-05 11:16

    using urllib you can set the Range header to request a certain range of bytes, but there are some consequences:

    • it depends on the server to honor the request
    • you assume that data you're looking for is within desired range (however you can make another request using different range header to get next bytes - i.e. download first 300 bytes and get another 300 only if you can't find title within first result - 2 requests of 300 bytes are still much cheaper than whole document)
    • (edit) - to avoid situations when title tag splits between two ranged requests, make your ranges overlapped, see 'range_header_overlapped' function in my example code

      import urllib

      req = urllib.request.Request('http://www.python.org/')

      req.headers['Range']='bytes=%s-%s' % (0, 300)

      f = urllib.request.urlopen(req)

      just to verify if server accepted our range:

      content_range=f.headers.get('Content-Range')

      print(content_range)

    0 讨论(0)
  • 2021-02-05 11:21

    You can defer downloading the entire response body by enabling stream mode of requests.

    Requests 2.14.2 documentation - Advanced Usage

    By default, when you make a request, the body of the response is downloaded immediately. You can override this behaviour and defer downloading the response body until you access the Response.content attribute with the stream parameter:

    ...

    If you set stream to True when making a request, Requests cannot release the connection back to the pool unless you consume all the data or call Response.close. This can lead to inefficiency with connections. If you find yourself partially reading request bodies (or not reading them at all) while using stream=True, you should consider using contextlib.closing (documented here)

    So, with this method, you can read the response chunk by chunk until you encounter the title tag. Since the redirects will be handled by the library you'll be ready to go.

    Here's an error-prone code tested with Python 2.7.10 and 3.6.0:

    try:
        from HTMLParser import HTMLParser
    except ImportError:
        from html.parser import HTMLParser
    
    import requests, re
    from contextlib import closing
    
    CHUNKSIZE = 1024
    retitle = re.compile("<title[^>]*>(.*?)</title>", re.IGNORECASE | re.DOTALL)
    buffer = ""
    htmlp = HTMLParser()
    with closing(requests.get("http://example.com/abc", stream=True)) as res:
        for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
            buffer = "".join([buffer, chunk])
            match = retitle.search(buffer)
            if match:
                print(htmlp.unescape(match.group(1)))
                break
    
    0 讨论(0)
  • 2021-02-05 11:21

    my code also solves cases when title tag is splitted between chunks.

    #!/usr/bin/env python2
    # -*- coding: utf-8 -*-
    """
    Created on Tue May 30 04:21:26 2017
    ====================
    @author: s
    """
    
    import requests
    from string import lower
    from html.parser import HTMLParser
    
    #proxies = { 'http': 'http://127.0.0.1:8080' }
    urls = ['http://opencvexamples.blogspot.com/p/learning-opencv-functions-step-by-step.html',
            'http://www.robindavid.fr/opencv-tutorial/chapter2-filters-and-arithmetic.html',
            'http://blog.iank.org/playing-capitals-with-opencv-and-python.html',
            'http://docs.opencv.org/3.2.0/df/d9d/tutorial_py_colorspaces.html',
            'http://scikit-image.org/docs/dev/api/skimage.exposure.html',
            'http://apprize.info/programming/opencv/8.html',
            'http://opencvexamples.blogspot.com/2013/09/find-contour.html',
            'http://docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html',
            'https://github.com/ArunJayan/OpenCV-Python/blob/master/resize.py']
    
    class TitleParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self)
            self.match = False
            self.title = ''
        def handle_starttag(self, tag, attributes):
            self.match = True if tag == 'title' else False
        def handle_data(self, data):
            if self.match:
                self.title = data
                self.match = False
    
    def valid_content( url, proxies=None ):
        valid = [ 'text/html; charset=utf-8',
                  'text/html',
                  'application/xhtml+xml',
                  'application/xhtml',
                  'application/xml',
                  'text/xml' ]
        r = requests.head(url, proxies=proxies)
        our_type = lower(r.headers.get('Content-Type'))
        if not our_type in valid:
            print('unknown content-type: {} at URL:{}'.format(our_type, url))
            return False
        return our_type in valid
    
    def range_header_overlapped( chunksize, seg_num=0, overlap=50 ):
        """
        generate overlapping ranges
        (to solve cases when title tag splits between them)
    
        seg_num: segment number we want, 0 based
        overlap: number of overlaping bytes, defaults to 50
        """
        start = chunksize * seg_num
        end = chunksize * (seg_num + 1)
        if seg_num:
            overlap = overlap * seg_num
            start -= overlap
            end -= overlap
        return {'Range': 'bytes={}-{}'.format( start, end )}
    
    def get_title_from_url(url, proxies=None, chunksize=300, max_chunks=5):
        if not valid_content(url, proxies=proxies):
            return False
        current_chunk = 0
        myparser = TitleParser()
        while current_chunk <= max_chunks:
            headers = range_header_overlapped( chunksize, current_chunk )
            headers['Accept-Encoding'] = 'deflate'
            # quick fix, as my locally hosted Apache/2.4.25 kept raising
            # ContentDecodingError when using "Content-Encoding: gzip"
            # ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', 
            #                  error('Error -3 while decompressing: incorrect header check',))
            r = requests.get( url, headers=headers, proxies=proxies )
            myparser.feed(r.content)
            if myparser.title:
                return myparser.title
            current_chunk += 1
        print('title tag not found within {} chunks ({}b each) at {}'.format(current_chunk-1, chunksize, url))
        return False
    
    0 讨论(0)
  • 2021-02-05 11:24

    You're scraping webpages using standard REST requests and I'm not aware of any request that only returns the title, so I don't think it's possible.

    I know this doesn't necessarily help get the title only, but I usually use BeautifulSoup for any web scraping. It's much easier. Here's an example.

    Code:

    import requests
    from bs4 import BeautifulSoup
    
    urls = ["http://www.google.com", "http://www.msn.com"]
    
    for url in urls:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, "html.parser")
    
        print "Title with tags: %s" % soup.title
        print "Title: %s" % soup.title.text
        print
    

    Output:

    Title with tags: <title>Google</title>
    Title: Google
    
    Title with tags: <title>MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos &amp; Videos</title>
    Title: MSN.com - Hotmail, Outlook, Skype, Bing, Latest News, Photos & Videos
    
    0 讨论(0)
  • 2021-02-05 11:29

    the kind of thing you want i don't think can be done, since the way the web is set up, you get the response for a request before anything is parsed. there isn't usually a streaming "if encounter <title> then stop giving me data" flag. if there is id love to see it, but there is something that may be able to help you. keep in mind, not all sites respect this. so some sites will force you to download the entire page source before you can act on it. but a lot of them will allow you to specify a range header. so in a requests example:

    import requests
    
    targeturl = "http://www.urbandictionary.com/define.php?term=Blarg&page=2"
    rangeheader = {"Range": "bytes=0-150"}
    
    response = requests.get(targeturl, headers=rangeheader)
    
    response.text
    

    and you get

    '<!DOCTYPE html>\n<html lang="en-US" prefix="og: http://ogp.me/ns#'
    

    now of course here's the problems with this what if you specify a range that is too short to get the title of the page? whats a good range to aim for? (combination of speed and assurance of accuracy) what happens if the page doesn't respect Range? (most of the time you just get the whole response you would have without it.)

    i don't know if this might help you? i hope so. but i've done similar things to only get file headers for download checking.

    EDIT4:

    so i thought of another kind of hacky thing that might help. nearly every page has a 404 page not found page. we might be able to use this to our advantage. instead of requesting the regular page. request something like this.

    http://www.urbandictionary.com/nothing.php
    

    the general page will have tons of information, links, data. but the 404 page is nothing more than a message, and (in this case) a video. and usually there is no video. just some text.

    but you also notice that the title still appears here. so perhaps we can just request something we know does not exist on any page like.

    X5ijsuUJSoisjHJFk948.php
    

    and get a 404 for each page. that way you only download a very small and minimalistic page. nothing more. which will significantly reduce the amount of information you download. thus increasing speed and efficiency.

    heres the problem with this method: you need to check somehow if the page does not supply its own version of the 404. most pages have it because it looks good with the site. and its standard practice to include one. but not all of them do. make sure to handle this case.

    but i think that could be something worth trying out. over the course of thousands of sites, it would save many ms of download time for each html.

    EDIT5:

    so as we talked about, since you are interested in urls that redirect. we might make use of an http head reqeust. which wont get the site content. just the headers. so in this case:

    response = requests.head('http://myshortenedurl.com/5b2su2')
    

    replace my shortenedurl with tunyurl to follow along.

    >>>response
    <Response [301]>
    

    nice so we know this redirects to something.

    >>>response.headers['Location']
    'http://stackoverflow.com'
    

    now we know where the url redirects to without actually following it or downloading any page source. now we can apply any of the other techniques previously discussed.

    Heres an example, using requests and lxml modules and using the 404 page idea. (be aware, i have to replace bit.ly with bit'ly so stack overflow doesnt get mad.)

    #!/usr/bin/python3
    
    import requests
    from lxml.html import fromstring
    
    links = ['http://bit'ly/MW2qgH',
             'http://bit'ly/1x0885j',
             'http://bit'ly/IFHzvO',
             'http://bit'ly/1PwR9xM']
    
    for link in links:
    
        response = '<Response [301]>'
        redirect = ''
    
        while response == '<Response [301]>':
            response = requests.head(link)
            try:
                redirect = response.headers['Location']
            except Exception as e:
                pass
    
        fakepage = redirect + 'X5ijsuUJSoisjHJFk948.php'
    
        scrapetarget = requests.get(fakepage)
        tree = fromstring(scrapetarget.text)
        print(tree.findtext('.//title'))
    

    so here we get the 404 pages, and it will follow any number of redirects. now heres the output from this:

    Urban Dictionary error
    Page Not Found - Stack Overflow
    Error 404 (Not Found)!!1
    Kijiji: Page Not Found
    

    so as you can see we did indeed get out titles. but we see some problems with the method. namely some titles add things, and some just dont have a good title at all. and thats the issue with that method. we could however try the range method too. benefits of that would be the title would be correct, but sometimes we might miss it, and sometimes we have to download the whole pagesource to get it. increasing required time.

    Also credit to alecxe for this part of my quick and dirty script

    tree = fromstring(scrapetarget.text)
    print(tree.findtext('.//title'))
    

    for an example with the range method. in the loop for link in links: change the code after the try catch statement to this:

    rangeheader = {"Range": "bytes=0-500"}
    
    scrapetargetsection = requests.get(redirect, headers=rangeheader)
    tree = fromstring(scrapetargetsection.text)
    print(tree.findtext('.//title'))
    

    output is:

    None
    Stack Overflow
    Google
    Kijiji: Free Classifieds in...
    

    here we see urban dictionary has no title or ive missed it in the bytes returned. in any of these methods there are tradeoffs. the only way to get close to total accuracy would be to download the entire source for each page i think.

    0 讨论(0)
  • 2021-02-05 11:36

    Question: ... the only place I can optimize is likely to not read in the entire page.

    This does not read the entire page.

    Note: Unicode .decode() will raise Exception if you cut a Unicode sequence in the middle. Using .decode(errors='ignore') remove those sequences.

    For instance:

    import re
    try:
        # PY3
        from urllib import request
    except:
        import urllib2 as request
    
    for url in ['http://www.python.org/', 'http://www.google.com', 'http://www.bit.ly']:
        f = request.urlopen(url)
        re_obj = re.compile(r'.*(<head.*<title.*?>(.*)</title>.*</head>)',re.DOTALL)
        Found = False
        data = ''
        while True:
            b_data = f.read(4096)
            if not b_data: break
    
            data += b_data.decode(errors='ignore')
            match = re_obj.match(data)
            if match:
                Found = True
                title = match.groups()[1]
                print('title={}'.format(title))
                break
    
        f.close()
    

    Output:
    title=Welcome to Python.org
    title=Google
    title=Bitly | URL Shortener and Link Management Platform

    Tested with Python: 3.4.2 and 2.7.9

    0 讨论(0)
提交回复
热议问题