I\'m scraping data from the web, and I have several processes of my scraper running in parallel.
I want the output of each of these processes to end up in the same f
Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.
If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.
In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.
Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.
As mentioned above it's quite a hack, but works pretty well =)
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat
same thing with '>>' :
( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log
and with exec on the last one you save one process:
( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat
No. It is not guaranteed that lines will remain intact. They can become intermingled.
From searching based on liori's answer I found this:
Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.
So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.
One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:
stackoverflow.com, stackexchange.com, fogcreek.com
you could do something like this
(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script
and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):
~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}
PING stackoverflow.com (64.34.119.12): 56 data bytes
--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes
--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms
--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms
Anyway, ymmv
Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.