I decided to choose Amazon Web Services to host my crawler where they both have SQS for queues but also auto scalable instances. It also have S3 where I can store all my images.
I also decided to rewrite my whole crawler to Python instead of PHP to more easily take advantage of things such as queues and to keep the app going 100% of the time, instead of using cronjobs.
So what I did, and what it means
I set up a Elastic Beanstalk Application for my crawler that is set to "Worker" and listening to a SQS where I store all the domains that need to be crawled. An SQS is a "queue" where I can save each domain that needs to be crawled, and the crawler will listen to the queue and fetch one domain at a time until the queue is done. There is no need for "cronjobs" or anything like that, as soon as the queue get data into it, it will send it to the crawler. Meaning the crawler is up 100% of the time, 24/7.
The Application is set to auto scaling, meaning that when I have too many domains in the queue, it will set up a second, third, fourth etc... instance/crawler to speed up the process. I think this is a very very very important point for anyone that wants to set up a crawler.
- All images are saved on a S3 instance. This means that the images are not saved on the server of the crawler and can easily be fetched and worked with.
The results have been great. When I had a PHP Crawler running on cronjobs every 15min, I could crawl about 600 urls per hour. Now I can without problems crawl 10'000+ urls per hour, even more depending on how I set my auto scaling.