Nodejs: Async request with a list of URL

前端 未结 2 1772
南旧
南旧 2020-12-02 02:50

I am working on a crawler. I have a list of URL need to be requested. There are several hundreds of request at the same time if I don\'t set it to be async. I am afraid that

相关标签:
2条回答
  • 2020-12-02 03:33

    you can use set timeout function to process all request within loop. for that you must know maximum time to process a request.

    0 讨论(0)
  • 2020-12-02 03:39

    The things you need to watch for are:

    1. Whether the target site has rate limiting and you may be blocked from access if you try to request too much too fast?

    2. How many simultaneous requests the target site can handle without degrading its performance?

    3. How much bandwidth your server has on its end of things?

    4. How many simultaneous requests your own server can have in flight and process without causing excess memory usage or a pegged CPU.

    In general, the scheme for managing all this is to create a way to tune how many requests you launch. There are many different ways to control this by number of simultaneous requests, number of requests per second, amount of data used, etc...

    The simplest way to start would be to just control how many simultaneous requests you make. That can be done like this:

    function runRequests(arrayOfData, maxInFlight, fn) {
        return new Promise((resolve, reject) => {
            let index = 0;
            let inFlight = 0;
    
            function next() {
                while (inFlight < maxInFlight && index < arrayOfData.length) {
                    ++inFlight;
                    fn(arrayOfData[index++]).then(result => {
                        --inFlight;
                        next();
                    }).catch(err => {
                        --inFlight;
                        console.log(err);
                        // purposely eat the error and let the rest of the processing continue
                        // if you want to stop further processing, you can call reject() here
                        next();
                    });
                }
                if (inFlight === 0) {
                    // all done
                    resolve();
                }
            }
            next();
        });
    }
    

    And, then you would use that like this:

    const rp = require('request-promise');
    
    // run the whole urlList, no more than 10 at a time
    runRequests(urlList, 10, function(url) {
        return rp(url).then(function(data) {
            // process fetched data here for one url
        }).catch(function(err) {
            console.log(url, err);
        });
    }).then(function() {
        // all requests done here
    });
    

    This can be made as sophisticated as you want by adding a time element to it (no more than N requests per second) or even a bandwidth element to it.

    I want one request is called after one request is completed.

    That's a very slow way to do things. If you really want that, then you can just pass a 1 for the maxInFlight parameter to the above function, but typically, things would work a lot faster and not cause problems by allowing somewhere between 5 and 50 simultaneous requests. Only testing would tell you where the sweet spot is for your particular target sites and your particular server infrastructure and amount of processing you need to do on the results.

    0 讨论(0)
提交回复
热议问题