问题
I have to compare checksum of all files in /primary
and /secondary
folders in machineA
with files in this folder /bat/snap/
which is in remote server machineB
. The remote server will have lots of files along with the files we have in machineA
.
- If there is any mismatch in checksum then I want to report all those files that have issues in
machineA
with full path and exit with non zero status code. - If everything is matching then exit zero.
I wrote one command (not sure whether there is any better way to write it) that I am running on machineA
but its very slow. Is there any way to make it faster?
(cd /primary && find . -type f -exec md5sum {} +; cd /secondary && find . -type f -exec md5sum {} +) | ssh machineB '(cd /bat/snap/ && md5sum -c)'
Also it prints out file name like this ./abc_monthly_1536_proc_7.data: OK
. Is there any way by which it can print out full path name of that file on machineA
?
ssh to remote host for every file definitely isn't very efficient. parallel
could speed it up by doing it concurrently for more files, but the more efficient way is likely to tweak the command a bit so it does ssh to machineB and gets all the md5sum in one shot. Is this possible to do?
回答1:
If your primary goal is not to count the checksums but list differences, perhaps faster (and easier) way would be to run rsync
with --dry-run
option. If any files listed, they differs, for example:
MBP:~ jhartman$ rsync -avr --dry-run rsync-test 192.168.1.100:/tmp/; echo $?
building file list ... done
rsync-test/file1.txt
sent 172 bytes received 26 bytes 396.00 bytes/sec
total size is 90 speedup is 0.45
Of course, because of --dry-run
no files changed on the target.
I hope it will help, Jarek
回答2:
If the files are in the directory /primary
and /secondary
instead of under these directories, lose the find.You may also wish to parallelize the md5-calculation. So that would make it:
#!/bin/bash
cd /primary
md5sum * > /tmp/file-p &
cd /secondary
md5sum * > /tmp/file-s &
wait
cat /tmp/file-p /tmp/file-s | ssh machineB '(cd /bat/snap/ && md5sum -c)'
With a relatively small set of files:
$ time find . -exec md5sum {} \;
7e74a9f865a91c5b56b5cab9709f1f36 ./file
631f01c98ff2016971fb1ea22be3c2cf ./hosts
d41d8cd98f00b204e9800998ecf8427e ./fortune8547
49d05af711e2d473f12375d720fb0a92 ./vboxdrv-Module.symvers
bf4b1d740f7151dea0f42f5e9e2b0c34 ./tmpavG1pB
a9b0d3af1b80a46b92dfe1ce56b2e85c ./in.clean.4524
real 0m0.046s
user 0m0.035s
sys 0m0.006s
$ time md5sum *
7e74a9f865a91c5b56b5cab9709f1f36 file
d41d8cd98f00b204e9800998ecf8427e fortune8547
631f01c98ff2016971fb1ea22be3c2cf hosts
a9b0d3af1b80a46b92dfe1ce56b2e85c in.clean.4524
bf4b1d740f7151dea0f42f5e9e2b0c34 tmpavG1pB
49d05af711e2d473f12375d720fb0a92 vboxdrv-Module.symvers
real 0m0.005s
user 0m0.003s
sys 0m0.002s
(just to prove that find is not always the quickest).
回答3:
Using md5sum
you can ask it to check files against an input md5sum
file.
man md5sum
: the following two options are useful:
-c, --check
: read MD5 sums from the FILEs and check them--quiet
: don't print OK for each successfully verified file
So all we need to do is build such a file and pass it on. The easiest is the following (from machineA
) :
$ cd /primary; md5sum * | ssh machineB '(cd /bat/snap; md5sum -c - --quiet 2>/dev/null)`
$ cd /secondary; md5sum * | ssh machineB '(cd /bat/snap; md5sum -c - --quiet 2>/dev/null)`
This will report things as :
file1: FAILED
file2: FAILED open or read
This will give you all the failed files per directory. You can do any post processing later on with any flavour of awk
.
回答4:
You can try to parallelize the process mentioned in the other answer. change the + to a \;, execute bash with &.
find $(pwd) -type f -exec bash -c "md5sum '{}' &" \;
来源:https://stackoverflow.com/questions/50070866/compare-checksum-of-files-between-two-servers-and-report-mismatch