I just learned that cpio has three modes: copy-out, copy-in and pass-through.
I was wondering what are the advantages and disadvantages of cpio under copy-out and copy-i
Why is cpio better than tar? A number of reasons.
When scripting, it has much better control over which files are and are not copied, since you must explicitly list the files you want copied. For example, which of the following is easier to read and understand?
find . -type f -name '*.sh' -print | cpio -o | gzip >sh.cpio.gz
or on Solaris:
find . -type f -name '*.sh' -print >/tmp/includeme
tar -cf - . -I /tmp/includeme | gzip >sh.tar.gz
or with gnutar:
find . -type f -name '*.sh' -print >/tmp/includeme
tar -cf - . --files-from=/tmp/includeme | gzip >sh.tar.gz
A couple of specific notes here: for large lists of files, you can't put find in reverse quotes; the command-line length will be overrun; you must use an intermediate file. Separate find and tar commands are inherently slower, since the actions are done serially.
Consider this more complex case where you want a tree completely packaged up, but some files in one tar, and the remaining files in another.
find . -depth -print >/tmp/files
egrep '\.sh$' /tmp/files | cpio -o | gzip >with.cpio.gz
egrep -v '\.sh$' /tmp/files | cpio -o | gzip >without.cpio.gz
or under Solaris:
find . -depth -print >/tmp/files
egrep '\.sh$' /tmp/files >/tmp/with
tar -cf - . -I /tmp/with | gzip >with.tar.gz
tar -cf - . /tmp/without | gzip >without.tar.gz
## ^^-- no there's no missing argument here. It's just empty that way
or with gnutar:
find . -depth -print >/tmp/files
egrep '\.sh$' /tmp/files >/tmp/with
tar -cf - . -I /tmp/with | gzip >with.tar.gz
tar -cf - . -X /tmp/without | gzip >without.tar.gz
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
find . -depth -print >/tmp/files
split /tmp/files
for F in /tmp/files?? ; do
cat $F | cpio -o | ssh destination "cd /target && cpio -idum" &
done
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
find . -depth -print >/tmp/files
npipe -4 /tmp/files 'cpio -o | ssh destination "cd /target && cpio -idum"'
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.