Multiple (asynchronous) connections with urllib2 or other http library?

别等时光非礼了梦想. 提交于 2019-11-26 09:35:36

问题


I have code like this.

for p in range(1,1000):
    result = False
    while result is False:
        ret = urllib2.Request(\'http://server/?\'+str(p))
        try:
            result = process(urllib2.urlopen(ret).read())
        except (urllib2.HTTPError, urllib2.URLError):
            pass
    results.append(result)

I would like to make two or three request at the same time to accelerate this. Can I use urllib2 for this, and how? If not which other library should I use? Thanks.


回答1:


You can use asynchronous IO to do this.

requests + gevent = grequests

GRequests allows you to use Requests with Gevent to make asynchronous HTTP Requests easily.

import grequests

urls = [
    'http://www.heroku.com',
    'http://tablib.org',
    'http://httpbin.org',
    'http://python-requests.org',
    'http://kennethreitz.com'
]

rs = (grequests.get(u) for u in urls)
grequests.map(rs)



回答2:


Take a look at gevent — a coroutine-based Python networking library that uses greenlet to provide a high-level synchronous API on top of libevent event loop.

Example:

#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.

"""Spawn multiple workers and wait for them to complete"""

urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']

import gevent
from gevent import monkey

# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()

import urllib2


def print_head(url):
    print 'Starting %s' % url
    data = urllib2.urlopen(url).read()
    print '%s: %s bytes: %r' % (url, len(data), data[:50])

jobs = [gevent.spawn(print_head, url) for url in urls]

gevent.joinall(jobs)



回答3:


So, it's 2016 😉 and we have Python 3.4+ with built-in asyncio module for asynchronous I/O. We can use aiohttp as HTTP client to download multiple URLs in parallel.

import asyncio
from aiohttp import ClientSession

async def fetch(url):
    async with ClientSession() as session:
        async with session.get(url) as response:
            return await response.read()

async def run(loop, r):
    url = "http://localhost:8080/{}"
    tasks = []
    for i in range(r):
        task = asyncio.ensure_future(fetch(url.format(i)))
        tasks.append(task)

    responses = await asyncio.gather(*tasks)
    # you now have all response bodies in this variable
    print(responses)

loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run(loop, 4))
loop.run_until_complete(future)

Source: copy-pasted from http://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html




回答4:


I know this question is a little old, but I thought it might be useful to promote another async solution built on the requests library.

list_of_requests = ['http://moop.com', 'http://doop.com', ...]

from simple_requests import Requests
for response in Requests().swarm(list_of_requests):
    print response.content

The docs are here: http://pythonhosted.org/simple-requests/




回答5:


Either you figure out threads, or you use Twisted (which is asynchronous).




回答6:


maybe using multiprocessing and divide you work on 2 process or so .

Here is an example (it's not tested)

import multiprocessing
import Queue
import urllib2


NUM_PROCESS = 2
NUM_URL = 1000


class DownloadProcess(multiprocessing.Process):
    """Download Process """

    def __init__(self, urls_queue, result_queue):

        multiprocessing.Process.__init__(self)

        self.urls = urls_queue
        self.result = result_queue

    def run(self):
        while True:

             try:
                 url = self.urls.get_nowait()
             except Queue.Empty:
                 break

             ret = urllib2.Request(url)
             res = urllib2.urlopen(ret)

             try:
                 result = res.read()
             except (urllib2.HTTPError, urllib2.URLError):
                     pass

             self.result.put(result)


def main():

    main_url = 'http://server/?%s'

    urls_queue = multiprocessing.Queue()
    for p in range(1, NUM_URL):
        urls_queue.put(main_url % p)

    result_queue = multiprocessing.Queue()

    for i in range(NUM_PROCESS):
        download = DownloadProcess(urls_queue, result_queue)
        download.start()

    results = []
    while result_queue:
        result = result_queue.get()
        results.append(result)

    return results

if __name__ == "__main__":
    results = main()

    for res in results:
        print(res)


来源:https://stackoverflow.com/questions/4119680/multiple-asynchronous-connections-with-urllib2-or-other-http-library

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!