What is the fastest way to send 100,000 HTTP requests in Python?

前端 未结 16 934
暖寄归人
暖寄归人 2020-11-22 07:12

I am opening a file which has 100,000 URL\'s. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many con

相关标签:
16条回答
  • 2020-11-22 07:25

    For your case, threading will probably do the trick as you'll probably be spending most time waiting for a response. There are helpful modules like Queue in the standard library that might help.

    I did a similar thing with parallel downloading of files before and it was good enough for me, but it wasn't on the scale you are talking about.

    If your task was more CPU-bound, you might want to look at the multiprocessing module, which will allow you to utilize more CPUs/cores/threads (more processes that won't block each other since the locking is per process)

    0 讨论(0)
  • 2020-11-22 07:26

    A solution:

    from twisted.internet import reactor, threads
    from urlparse import urlparse
    import httplib
    import itertools
    
    
    concurrent = 200
    finished=itertools.count(1)
    reactor.suggestThreadPoolSize(concurrent)
    
    def getStatus(ourl):
        url = urlparse(ourl)
        conn = httplib.HTTPConnection(url.netloc)   
        conn.request("HEAD", url.path)
        res = conn.getresponse()
        return res.status
    
    def processResponse(response,url):
        print response, url
        processedOne()
    
    def processError(error,url):
        print "error", url#, error
        processedOne()
    
    def processedOne():
        if finished.next()==added:
            reactor.stop()
    
    def addTask(url):
        req = threads.deferToThread(getStatus, url)
        req.addCallback(processResponse, url)
        req.addErrback(processError, url)   
    
    added=0
    for url in open('urllist.txt'):
        added+=1
        addTask(url.strip())
    
    try:
        reactor.run()
    except KeyboardInterrupt:
        reactor.stop()
    

    Testtime:

    [kalmi@ubi1:~] wc -l urllist.txt
    10000 urllist.txt
    [kalmi@ubi1:~] time python f.py > /dev/null 
    
    real    1m10.682s
    user    0m16.020s
    sys 0m10.330s
    [kalmi@ubi1:~] head -n 6 urllist.txt
    http://www.google.com
    http://www.bix.hu
    http://www.godaddy.com
    http://www.google.com
    http://www.bix.hu
    http://www.godaddy.com
    [kalmi@ubi1:~] python f.py | head -n 6
    200 http://www.bix.hu
    200 http://www.bix.hu
    200 http://www.bix.hu
    200 http://www.bix.hu
    200 http://www.bix.hu
    200 http://www.bix.hu
    

    Pingtime:

    bix.hu is ~10 ms away from me
    godaddy.com: ~170 ms
    google.com: ~30 ms
    
    0 讨论(0)
  • 2020-11-22 07:29

    A solution using tornado asynchronous networking library

    from tornado import ioloop, httpclient
    
    i = 0
    
    def handle_request(response):
        print(response.code)
        global i
        i -= 1
        if i == 0:
            ioloop.IOLoop.instance().stop()
    
    http_client = httpclient.AsyncHTTPClient()
    for url in open('urls.txt'):
        i += 1
        http_client.fetch(url.strip(), handle_request, method='HEAD')
    ioloop.IOLoop.instance().start()
    
    0 讨论(0)
  • 2020-11-22 07:30

    Using a thread pool is a good option, and will make this fairly easy. Unfortunately, python doesn't have a standard library that makes thread pools ultra easy. But here is a decent library that should get you started: http://www.chrisarndt.de/projects/threadpool/

    Code example from their site:

    pool = ThreadPool(poolsize)
    requests = makeRequests(some_callable, list_of_args, callback)
    [pool.putRequest(req) for req in requests]
    pool.wait()
    

    Hope this helps.

    0 讨论(0)
  • 2020-11-22 07:31

    A good approach to solving this problem is to first write the code required to get one result, then incorporate threading code to parallelize the application.

    In a perfect world this would simply mean simultaneously starting 100,000 threads which output their results into a dictionary or list for later processing, but in practice you are limited in how many parallel HTTP requests you can issue in this fashion. Locally, you have limits in how many sockets you can open concurrently, how many threads of execution your Python interpreter will allow. Remotely, you may be limited in the number of simultaneous connections if all the requests are against one server, or many. These limitations will probably necessitate that you write the script in such a way as to only poll a small fraction of the URLs at any one time (100, as another poster mentioned, is probably a decent thread pool size, although you may find that you can successfully deploy many more).

    You can follow this design pattern to resolve the above issue:

    1. Start a thread which launches new request threads until the number of currently running threads (you can track them via threading.active_count() or by pushing the thread objects into a data structure) is >= your maximum number of simultaneous requests (say 100), then sleeps for a short timeout. This thread should terminate when there is are no more URLs to process. Thus, the thread will keep waking up, launching new threads, and sleeping until your are finished.
    2. Have the request threads store their results in some data structure for later retrieval and output. If the structure you are storing the results in is a list or dict in CPython, you can safely append or insert unique items from your threads without locks, but if you write to a file or require in more complex cross-thread data interaction you should use a mutual exclusion lock to protect this state from corruption.

    I would suggest you use the threading module. You can use it to launch and track running threads. Python's threading support is bare, but the description of your problem suggests that it is completely sufficient for your needs.

    Finally, if you'd like to see a pretty straightforward application of a parallel network application written in Python, check out ssh.py. It's a small library which uses Python threading to parallelize many SSH connections. The design is close enough to your requirements that you may find it to be a good resource.

    0 讨论(0)
  • 2020-11-22 07:38

    Create epoll object,
    open many client TCP sockets,
    adjust their send buffers to be a bit more than request header,
    send a request header — it should be immediate, just placing into a buffer, register socket in epoll object,
    do .poll on epoll obect,
    read first 3 bytes from each socket from .poll,
    write them to sys.stdout followed by \n (don't flush), close the client socket.

    Limit number of sockets opened simultaneously — handle errors when sockets are created. Create a new socket only if another is closed.
    Adjust OS limits.
    Try forking into a few (not many) processes: this may help to use CPU a bit more effectively.

    0 讨论(0)
提交回复
热议问题