Typical politeness factor for a web crawler?

后端 未结 1 1091
甜味超标
甜味超标 2021-02-01 10:44

What is a typical politeness factor for a web crawler?

Apart from always obeying robot.txt
Both the \"Disallow:\" and non standard \"Crawl-delay:\"

But if a

1条回答
  •  -上瘾入骨i
    2021-02-01 11:26

    The algorithm we use is:

    // If we are blocked by robots.txt
    // Make sure it is obeyed.
    // Our bots user-agent string contains a link to a html page explaining this.
    // Also an email address to be added to so that we never even consider their domain in the future
    
    // If we receive more that 5 consecutive responses with HTTP response code of 500+ (or timeouts)
    // Then we assume the domain is either under heavy load and does not need us adding to it.
    // Or the URL we are crawling are completely wrong and causing problems
    // Wither way we suspend crawling from this domain for 4 hours.
    
    // There is a non-standard parameter in robots.txt that defines a min crawl delay
    // If it exists then obey it.
    //
    //    see: http://www.searchtools.com/robots/robots-txt-elements.html
    double PolitenssFromRobotsTxt = getRobotPolitness();
    
    
    // Work Size politeness
    // Large popular domains are designed to handle load so we can use a
    // smaller delay on these sites then for smaller domains (thus smaller domains hosted by
    // mom and pops by the family PC under the desk in the office are crawled slowly).
    //
    // But the max delay here is 5 seconds:
    //
    //    domainSize => Range 0 -> 10
    //
    double workSizeTime = std::min(exp(2.52166863221 + -0.530185027289 * log(domainSize)), 5);
    //
    // You can find out how important we think your site is here:
    //      http://www.opensiteexplorer.org
    // Look at the Domain Authority and diveide by 10.
    // Note: This is not exactly the number we use but the two numbers are highly corelated
    //       Thus it will usually give you a fair indication.
    
    
    
    // Take into account the response time of the last request.
    // If the server is under heavy load and taking a long time to respond
    // then we slow down the requests. Note time-outs are handled above
    double responseTime = pow(0.203137637588 + 0.724386103344 * lastResponseTime, 2);
    
    // Use the slower of the calculated times
    double result = std::max(workSizeTime, responseTime);
    
    //Never faster than the crawl-delay directive
    result = std::max(result, PolitenssFromRobotsTxt);
    
    
    // Set a minimum delays
    // So never hit a site more than every 10th of a second
    result = std::max(result, 0.1);
    
    // The maximum delay we have is every 2 minutes.
    result = std::min(result, 120.0)
    

    0 讨论(0)
提交回复
热议问题