I have a site with about 150K pages in its sitemap. I\'m using the sitemap index generator to make the sitemaps, but really, I need a way of caching it, because building the 150
Okay - I have found some more info on this and what amazon are doing with their 6 million or so URLS.
Amazon simply make a map for each day and add to it:
So this means that they end up with loads of site-maps - but the search bot will only look at the latest ones - as the updated dates are recent. I was under the understanding that one should refresh a map - and not include a url more than once. I think this is true. But, Amazon get around this as the site maps are more of a log. A url may appear in a later site-map - as it maybe updated - but Google wont look at the older maps as they are out of date - unless of course it does a major re-index. This approach makes a lot of sense as all you do is simply build a new map - say each day of new and updated content and ping it at google - thus google only needs to index these new urls.
This log approach is a synch to code - as all you need is a static data-store model that stores the XML data for each map. your cron job can build a map - daily or weekly and then store the raw XML page in a blob field or what have you. you can then serve the pages straight from a handler and also the index map too.
I'm not sure what others think but this sounds like a very workable approach and a load off ones server - compared to rebuilding huge map just because a few pages may have changed.
I have also considered that it may be possible to then crunch a weeks worth of maps into a week map and 4 weeks of maps into a month - so you end up with monthly maps, a map for each week in the current month and then a map for the last 7 days. Assuming that the dates are all maintained this will reduce the number of maps tidy up the process - im thinking in terms of reducing 365 maps for each day of the year down to 12.
Here is a pdf on site maps and the approaches used by amazon and CNN.
http://www.wwwconference.org/www2009/proceedings/pdf/p991.pdf