I\'m scraping data from the web, and I have several processes of my scraper running in parallel.
I want the output of each of these processes to end up in the same f
One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:
stackoverflow.com, stackexchange.com, fogcreek.com
you could do something like this
(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script
and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):
~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}
PING stackoverflow.com (64.34.119.12): 56 data bytes
--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes
--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms
--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms
Anyway, ymmv