I observed a few times now that A | B | C
may not lead to immediate output, although A
is constantly producing output. I have no idea how this even may be possible. From my understanding all three processes ought to be working on the same time, putting their output into the next pipe (or stdout) and taking from the previous pipe when they are finished with one step.
Here's an example where I am currently experiencing that:
tcpflow -ec -i any port 8340 | tee second.flow | grep -i "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
What is supposed to happen:
I look at one port for tcp packages. If something comes it should be a certain XML format and I want to grep the Manufacturer and the Serialnumber from these packages. I would also like to get the full, unmodified output in a text file "second.flow", for later reference.
What happens:
Everything as desired, but instead of getting output every 10 seconds (I'm sure I get these outputs every ten seconds!) I have to wait for a long time and then a lot is printed at once. It's like one of the tools gobbles up everything in a buffer and only prints it if the buffer is full. I don't want that. I want to get each line as fast as possible.
If I replace tcpflow ...
with a cat second.flow
it works immediately. Can someone describe what's going on? And in case that it's obvious would there be another way to achieve the same result?
Every layer in a series of pipes can involve buffering; by default, tools that don't specify buffering behavior for stdout
will use line buffering when outputting to a terminal, and block buffering when outputting anywhere else (including piping to another program or a file). In a chained pipe, all but the last stage will see their output as not going to the terminal, and will block buffer.
So in your case, tcpflow
might be producing output constantly, and if it's doing so, tee
should be producing data almost at the same rate. But grep
is going to limit that flow to a trickle, and won't produce output until that trickle exceeds the size of the output buffer. It's already performed the filtering and called fwrite
or puts
or printf
, but the data is waiting for enough bytes to build up behind it before sending it along to awk
, to reduce the number of (expensive) system calls.
cat second.flow
produces output immediately because as soon as cat
finishes producing output, it exits, flushing and closing its stdout
in the process, which cascades, when each step finds its stdin
to be at EOF, it exits, flushing and closing its stdout
. tcpflow
isn't exiting, so the cascade of EOFs and flushing isn't happening.
For some programs, in the general case, you can change the buffering behavior by using stdbuf
(or unbuffer
, though that can't do line buffering to balance efficiency, and has issues with piped input). If the program is using internal buffering, this still might not work, but it's worth a shot.
In your specific case, though, since it's likely grep
that's causing the interruption (by only producing a trickle of output that is sticking in the buffer, where tcpflow
and tee
are producing a torrent, and awk is connected to stdout
and therefore line buffered by default), you can just adjust your command line to:
tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
At least for Linux's grep
(not sure if switch is standard), that makes grep
change its own output buffering to line-oriented buffering explicitly, which should remove the delay. If tcpflow
itself is not producing enough output to flush regularly (you implied it did, but you could be wrong), you'd use stdbuf
on it (but not tee
, which, per stdbuf
man page notes, manually changes its buffering, so stdbuf
doesn't do anything) to make them line buffered:
stdbuf -oL tcpflow -ec -i any port 8340 | tee second.flow | grep -i --line-buffered "\(</Manufacturer>\)\|\(</SerialNumber>\)" | awk -F'[<>]' '{print $3}'
Update from comments: It looks like some flavors of awk
block buffer prints to stdout
, even when connected to a terminal. For mawk
(the default on many Debian based distros), you can non-portably disable it by passing the -Winteractive
switch at invocation. Alternatively, to work portably, you can just call system("")
after each print
, which portably forces output flushing on all implementations of awk
. Sadly, the obvious fflush()
is not portable to older implementations of awk
, but if you only care about modern awk
, just use fflush()
to be obvious and mostly portable.
Reduce Buffering
Each application in the pipeline can do its own buffering. You may want to see if you can reduce buffering in tcpflow, as your other commands are line-oriented and unlikely to be the source of your buffering issue. I didn't see any specific options for buffer control in tcpflow, though the -b
flag for max_bytes may help in circumstances where the text you want to work with is near the front of the flow.
You can also try modifying the buffering of tcpflow using stdbuf from GNU coreutils. This may help to reduce latency in your pipeline, but the man page provides the following caveats:
NOTE: If COMMAND adjusts the buffering of its standard streams ('tee' does for example) then that will override corresponding changes by 'stdbuf'. Also some filters (like 'dd' and 'cat' etc.) don't use streams for I/O, and are thus unaffected by 'stdbuf' settings.
As an example, the following may reduce output buffering of tcpflow:
stdbuf --output=0 tcpflow -ec -i any port 8340 # unbuffered output
stdbuf --output=L tcpflow -ec -i any port 8340 # line-buffered output
unless one of the caveats above apply. Your mileage may vary.
来源:https://stackoverflow.com/questions/33529573/piping-sometimes-does-not-lead-to-immediate-output