What is the fastest way to send 100,000 HTTP requests in Python?

前端 未结 16 933
暖寄归人
暖寄归人 2020-11-22 07:12

I am opening a file which has 100,000 URL\'s. I need to send an HTTP request to each URL and print the status code. I am using Python 2.6, and so far looked at the many con

相关标签:
16条回答
  • 2020-11-22 07:39

    Consider using Windmill , although Windmill probably cant do that many threads.

    You could do it with a hand rolled Python script on 5 machines, each one connecting outbound using ports 40000-60000, opening 100,000 port connections.

    Also, it might help to do a sample test with a nicely threaded QA app such as OpenSTA in order to get an idea of how much each server can handle.

    Also, try looking into just using simple Perl with the LWP::ConnCache class. You'll probably get more performance (more connections) that way.

    0 讨论(0)
  • 2020-11-22 07:41

    The easiest way would be to use Python's built-in threading library. They're not "real" / kernel threads They have issues (like serialization), but are good enough. You'd want a queue & thread pool. One option is here, but it's trivial to write your own. You can't parallelize all 100,000 calls, but you can fire off 100 (or so) of them at the same time.

    0 讨论(0)
  • 2020-11-22 07:42

    Twistedless solution:

    from urlparse import urlparse
    from threading import Thread
    import httplib, sys
    from Queue import Queue
    
    concurrent = 200
    
    def doWork():
        while True:
            url = q.get()
            status, url = getStatus(url)
            doSomethingWithResult(status, url)
            q.task_done()
    
    def getStatus(ourl):
        try:
            url = urlparse(ourl)
            conn = httplib.HTTPConnection(url.netloc)   
            conn.request("HEAD", url.path)
            res = conn.getresponse()
            return res.status, ourl
        except:
            return "error", ourl
    
    def doSomethingWithResult(status, url):
        print status, url
    
    q = Queue(concurrent * 2)
    for i in range(concurrent):
        t = Thread(target=doWork)
        t.daemon = True
        t.start()
    try:
        for url in open('urllist.txt'):
            q.put(url.strip())
        q.join()
    except KeyboardInterrupt:
        sys.exit(1)
    

    This one is slighty faster than the twisted solution and uses less CPU.

    0 讨论(0)
  • 2020-11-22 07:43

    Things have changed quite a bit since 2010 when this was posted and I haven't tried all the other answers but I have tried a few, and I found this to work the best for me using python3.6.

    I was able to fetch about ~150 unique domains per second running on AWS.

    import pandas as pd
    import concurrent.futures
    import requests
    import time
    
    out = []
    CONNECTIONS = 100
    TIMEOUT = 5
    
    tlds = open('../data/sample_1k.txt').read().splitlines()
    urls = ['http://{}'.format(x) for x in tlds[1:]]
    
    def load_url(url, timeout):
        ans = requests.head(url, timeout=timeout)
        return ans.status_code
    
    with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:
        future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls)
        time1 = time.time()
        for future in concurrent.futures.as_completed(future_to_url):
            try:
                data = future.result()
            except Exception as exc:
                data = str(type(exc))
            finally:
                out.append(data)
    
                print(str(len(out)),end="\r")
    
        time2 = time.time()
    
    print(f'Took {time2-time1:.2f} s')
    print(pd.Series(out).value_counts())
    
    0 讨论(0)
提交回复
热议问题