Limiting/throttling the rate of HTTP requests in GRequests

前端 未结 4 1483
盖世英雄少女心
盖世英雄少女心 2020-12-23 14:46

I\'m writing a small script in Python 2.7.3 with GRequests and lxml that will allow me to gather some collectible card prices from various websites and compare them. Problem

相关标签:
4条回答
  • 2020-12-23 15:10

    Going to answer my own question since I had to figure this by myself and there seems to be very little info on this going around.

    The idea is as follows. Every request object used with GRequests can take a session object as a parameter when created. Session objects on the other hand can have HTTP adapters mounted that are used when making requests. By creating our own adapter we can intercept requests and rate-limit them in way we find best for our application. In my case I ended up with the code below.

    Object used for throttling:

    DEFAULT_BURST_WINDOW = datetime.timedelta(seconds=5)
    DEFAULT_WAIT_WINDOW = datetime.timedelta(seconds=15)
    
    
    class BurstThrottle(object):
        max_hits = None
        hits = None
        burst_window = None
        total_window = None
        timestamp = None
    
        def __init__(self, max_hits, burst_window, wait_window):
            self.max_hits = max_hits
            self.hits = 0
            self.burst_window = burst_window
            self.total_window = burst_window + wait_window
            self.timestamp = datetime.datetime.min
    
        def throttle(self):
            now = datetime.datetime.utcnow()
            if now < self.timestamp + self.total_window:
                if (now < self.timestamp + self.burst_window) and (self.hits < self.max_hits):
                    self.hits += 1
                    return datetime.timedelta(0)
                else:
                    return self.timestamp + self.total_window - now
            else:
                self.timestamp = now
                self.hits = 1
                return datetime.timedelta(0)
    

    HTTP adapter:

    class MyHttpAdapter(requests.adapters.HTTPAdapter):
        throttle = None
    
        def __init__(self, pool_connections=requests.adapters.DEFAULT_POOLSIZE,
                     pool_maxsize=requests.adapters.DEFAULT_POOLSIZE, max_retries=requests.adapters.DEFAULT_RETRIES,
                     pool_block=requests.adapters.DEFAULT_POOLBLOCK, burst_window=DEFAULT_BURST_WINDOW,
                     wait_window=DEFAULT_WAIT_WINDOW):
            self.throttle = BurstThrottle(pool_maxsize, burst_window, wait_window)
            super(MyHttpAdapter, self).__init__(pool_connections=pool_connections, pool_maxsize=pool_maxsize,
                                                max_retries=max_retries, pool_block=pool_block)
    
        def send(self, request, stream=False, timeout=None, verify=True, cert=None, proxies=None):
            request_successful = False
            response = None
            while not request_successful:
                wait_time = self.throttle.throttle()
                while wait_time > datetime.timedelta(0):
                    gevent.sleep(wait_time.total_seconds(), ref=True)
                    wait_time = self.throttle.throttle()
    
                response = super(MyHttpAdapter, self).send(request, stream=stream, timeout=timeout,
                                                           verify=verify, cert=cert, proxies=proxies)
    
                if response.status_code != 429:
                    request_successful = True
    
            return response
    

    Setup:

    requests_adapter = adapter.MyHttpAdapter(
        pool_connections=__CONCURRENT_LIMIT__,
        pool_maxsize=__CONCURRENT_LIMIT__,
        max_retries=0,
        pool_block=False,
        burst_window=datetime.timedelta(seconds=5),
        wait_window=datetime.timedelta(seconds=20))
    
    requests_session = requests.session()
    requests_session.mount('http://', requests_adapter)
    requests_session.mount('https://', requests_adapter)
    
    unsent_requests = (grequests.get(url,
                                     hooks={'response': handle_response},
                                     session=requests_session) for url in urls)
    grequests.map(unsent_requests, size=__CONCURRENT_LIMIT__)
    
    0 讨论(0)
  • 2020-12-23 15:16

    I had a similar problem. Here's my solution. In your case, I would do:

    def worker():
        with rate_limit('slow.domain.com', 2):
            response = requests.get('https://slow.domain.com/path')
            text = response.text
        # Use `text`
    

    Assuming you have multiple domains you're culling from, I would setup a dictionary mapping (domain, delay) so you don't hit your rate limits.

    This code assumes you're going to use gevent and monkey patch.

    from contextlib import contextmanager
    from gevent.event import Event
    from gevent.queue import Queue
    from time import time
    
    
    def rate_limit(resource, delay, _queues={}):
        """Delay use of `resource` until after `delay` seconds have passed.
    
        Example usage:
    
        def worker():
            with rate_limit('foo.bar.com', 1):
                response = requests.get('https://foo.bar.com/path')
                text = response.text
            # use `text`
    
        This will serialize and delay requests from multiple workers for resource
        'foo.bar.com' by 1 second.
    
        """
    
        if resource not in _queues:
            queue = Queue()
            gevent.spawn(_watch, queue)
            _queues[resource] = queue
    
        return _resource_manager(_queues[resource], delay)
    
    
    def _watch(queue):
        "Watch `queue` and wake event listeners after delay."
    
        last = 0
    
        while True:
            event, delay = queue.get()
    
            now = time()
    
            if (now - last) < delay:
                gevent.sleep(delay - (now - last))
    
            event.set()   # Wake worker but keep control.
            event.clear()
            event.wait()  # Yield control until woken.
    
            last = time()
    
    
    @contextmanager
    def _resource_manager(queue, delay):
        "`with` statement support for `rate_limit`."
    
        event = Event()
        queue.put((event, delay))
    
        event.wait() # Wait for queue watcher to wake us.
    
        yield
    
        event.set()  # Wake queue watcher.
    
    0 讨论(0)
  • 2020-12-23 15:27

    Doesn't look like there's any simple mechanism for handling this build in to the requests or grequests code. The only hook that seems to be around is for responses.

    Here's a super hacky work-around to at least prove it's possible - I modified grequests to keep a list of the time when a request was issued and sleep the creation of the AsyncRequest until the requests per second were below the maximum.

    class AsyncRequest(object):
        def __init__(self, method, url, **kwargs):
            print self,'init'
            waiting=True
            while waiting:
                if len([x for x in q if x > time.time()-15]) < 8:
                    q.append(time.time())
                    waiting=False
                else:
                    print self,'snoozing'
                    gevent.sleep(1)
    

    You can use grequests.imap() to watch this interactively

    import time
    import rg
    
    urls = [
            'http://www.heroku.com',
            'http://python-tablib.org',
            'http://httpbin.org',
            'http://python-requests.org',
            'http://kennethreitz.com',
            'http://www.cnn.com',
    ]
    
    def print_url(r, *args, **kwargs):
            print(r.url),time.time()
    
    hook_dict=dict(response=print_url)
    rs = (rg.get(u, hooks=hook_dict) for u in urls)
    for r in rg.imap(rs):
            print r
    

    I wish there was a more elegant solution, but so far I can't find one. Looked around in sessions and adapters. Maybe the poolmanager could be augmented instead?

    Also, I wouldn't put this code in production - the 'q' list never gets trimmed and would eventually get pretty big. Plus, I don't know if it's actually working as advertised. It just looks like it is when I look at the console output.

    Ugh. Just looking at this code I can tell it's 3am. Time to goto bed.

    0 讨论(0)
  • 2020-12-23 15:36

    Take a look at this for automatic requests throttling: https://pypi.python.org/pypi/RequestsThrottler/0.2.2

    You can set both a fixed amount of delay between each request or set a number of requests to send in a fixed amount of seconds (which is basically the same thing):

    import requests
    from requests_throttler import BaseThrottler
    
    request = requests.Request(method='GET', url='http://www.google.com')
    reqs = [request for i in range(0, 5)]  # An example list of requests
    with BaseThrottler(name='base-throttler', delay=1.5) as bt:
        throttled_requests = bt.multi_submit(reqs)
    

    where the function multi_submit returns a list of ThrottledRequest (see doc: link at the end).

    You can then access to the responses:

    for tr in throttled_requests:
        print tr.response
    

    Alternatively you can achieve the same by specifying the number or requests to send in a fixed amount of time (e.g. 15 requests every 60 seconds):

    import requests
    from requests_throttler import BaseThrottler
    
    request = requests.Request(method='GET', url='http://www.google.com')
    reqs = [request for i in range(0, 5)]  # An example list of requests
    with BaseThrottler(name='base-throttler', reqs_over_time=(15, 60)) as bt:
        throttled_requests = bt.multi_submit(reqs)
    

    Both solutions can be implemented without the usage of the with statement:

    import requests
    from requests_throttler import BaseThrottler
    
    request = requests.Request(method='GET', url='http://www.google.com')
    reqs = [request for i in range(0, 5)]  # An example list of requests
    bt = BaseThrottler(name='base-throttler', delay=1.5)
    bt.start()
    throttled_requests = bt.multi_submit(reqs)
    bt.shutdown()
    

    For more details: http://pythonhosted.org/RequestsThrottler/index.html

    0 讨论(0)
提交回复
热议问题