Get past request limit in crawling a web site

后端 未结 4 1231
盖世英雄少女心
盖世英雄少女心 2021-02-03 11:33

I\'m working on a web crawler that indexes sites that don\'t want to be indexed.

My first attempt: I wrote a c# crawler that goes through each and every page and downlo

4条回答
  •  走了就别回头了
    2021-02-03 11:45

    OK, first and foremost: if a website doesn't want you to crawl it too often then you shouldn't! It's basic politeness and you should always try to adhere to it.

    However, I do understand that there are some websites, like Google, who make their money by crawling your website all day long and when you try to crawl Google, then they block you.

    Solution 1: Proxy Servers

    In any case, the alternative to getting a bunch of EC2 machines is to get proxy servers. Proxy servers are MUCH cheaper than EC2, case and point: http://5socks.net/en_proxy_socks_tarifs.htm

    Of course, proxy servers are not as fast as EC2 (bandwidth wise), but you should be able to strike a balance where you're getting similar or higher throughput than your 50 EC2 instances for substantially less than what you're paying now. This involves you searching for affordable proxies and finding ones that will give you similar results. A thing to note here is that just like you, there may be other people using the proxy service to crawl the website you're crawling and they may not be as smart about how they crawl it, so the whole proxy service can get blocked due to the activity of some other client of the proxy service (I've personally seen it).

    Solution 2: You-Da-Proxy!

    This is a little crazy and I haven't done the math behind this, but you could start a proxy service yourself and sell proxy services to others. You can't use all of your EC2 machine's bandwidth anyway, so the best way for you to cut cost is to do what Amazon does: sub-lease the hardware.

提交回复
热议问题