Is it safe to pipe the output of several parallel processes to one file using >>?

后端未结

关注

 9  1183

一整个雨季

I\'m scraping data from the web, and I have several processes of my scraper running in parallel.

I want the output of each of these processes to end up in the same f

相关标签:

9条回答

闹比i

2020-12-25 11:51

Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.

If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人及你

2020-12-25 11:54

In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.

Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.

0 讨论(0)
发布评论:

提交评论
- 加载中...

眼角桃花

2020-12-25 11:55

As mentioned above it's quite a hack, but works pretty well =)

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) | cat

same thing with '>>' :

( ping stackoverflow.com & ping stackexchange.com & ping fogcreek.com ) >> log

and with exec on the last one you save one process:

( ping stackoverflow.com & ping stackexchange.com & exec ping fogcreek.com ) | cat

0 讨论(0)

慢半拍i

2020-12-25 12:03

No. It is not guaranteed that lines will remain intact. They can become intermingled.

From searching based on liori's answer I found this:

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.

So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.

0 讨论(0)
发布评论:

提交评论
- 加载中...

生来不讨喜

2020-12-25 12:03

One possibly interesting thing you could do is use gnu parallel: http://www.gnu.org/s/parallel/ For example if you you were spidering the sites:

stackoverflow.com, stackexchange.com, fogcreek.com

you could do something like this

(echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k your_spider_script

and the output is buffered by parallel and because of the -k option returned to you in the order of the site list above. A real example (basically copied from the 2nd parallel screencast):

 ~ $ (echo stackoverflow.com; echo stackexchange.com; echo fogcreek.com) | parallel -k ping -c 1 {}


PING stackoverflow.com (64.34.119.12): 56 data bytes

--- stackoverflow.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING stackexchange.com (64.34.119.12): 56 data bytes

--- stackexchange.com ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss
PING fogcreek.com (64.34.80.170): 56 data bytes
64 bytes from 64.34.80.170: icmp_seq=0 ttl=250 time=23.961 ms

--- fogcreek.com ping statistics ---
1 packets transmitted, 1 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 23.961/23.961/23.961/0.000 ms

Anyway, ymmv

0 讨论(0)

清酒与你

2020-12-25 12:03

Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页