问题
We need to download metadata for all iOS apps on a daily basis. We plan on extracting the information by crawling the iTunes website and by using the iTunes search API. Since there are 700K+ apps, we need an efficient way to do this.
One approach is to set up a bunch of scripts on EC2 and run them in parallel. Before we embark down this path, are there services like 80legs that people have used to accomplish a similar task? Essentially, we want something to help us crawl hundreds of thousands of pages (or make a bunch of API calls) very fast.
回答1:
You might want to look into Apple's Enterprise Partner Feed (EPF). It will probably be much cheaper than getting a bunch of EC2 machines or building up the crawling infrastructure to scrape the data. From the EFP description itself:
The Enterprise Partner Feed is a data feed of the complete set of metadata from iTunes and the App Store. It is available for affiliate partners to fully incorporate aspects of the iTunes and App Store catalogs into a web site or app.
EPF has two feed modes
iTunes generates the EPF data in two modes:
full mode
incremental modeThe full export is generated weekly and contains a complete snapshot of iTunes metadata as of the day of generation. The incremental export is generated daily and contains records that have been added or modified since the last full export. The incremental exports are located relative to the full export on which they are based.
Obviously, you'd use the full mode when you want to populate your data, then you would use the incremental one for the daily updates.
Good luck.
来源:https://stackoverflow.com/questions/14988664/fastest-service-for-crawling-web-pages-or-invoking-apis-itunes-in-particular