How can I calculate an md5 checksum of a directory?

。_饼干妹妹 提交于 2019-11-30 10:04:22

问题


I need to calculate a summary md5 checksum for all files of a particular type (*.py for example) placed under a directory and all sub-directories.

What is the best way to do that?

Edit: The proposed solutions are very nice, but this is not exactly what I need. I'm looking for a solution to get a single summary checksum which will uniquely identify the directory as a whole - including content of all its sub-directories.


回答1:


find /path/to/dir/ -type f -name "*.py" -exec md5sum {} + | awk '{print $1}' | sort | md5sum

The find command lists all the files that end in .py. The md5sum is computed for each .py file. awk is used to pick off the md5sums (ignoring the filenames, which may not be unique). The md5sums are sorted. The md5sum of this sorted list is then returned.

I've tested this by copying a test directory:

rsync -a ~/pybin/ ~/pybin2/

I renamed some of the files in ~/pybin2.

The find...md5sum command returns the same output for both directories.

2bcf49a4d19ef9abd284311108d626f1  -



回答2:


Create a tar archive file on the fly and pipe that to md5sum:

tar c dir | md5sum

This produces a single md5sum that should be unique to your file and sub-directory setup. No files are created on disk.




回答3:


ire_and_curses's suggestion of using tar c <dir> has some issues:

  • tar processes directory entries in the order which they are stored in the filesystem, and there is no way to change this order. This effectively can yield completely different results if you have the "same" directory on different places, and I know no way to fix this (tar cannot "sort" its input files in a particular order).
  • I usually care about whether groupid and ownerid numbers are the same, not necessarily whether the string representation of group/owner are the same. This is in line with what for example rsync -a --delete does: it synchronizes virtually everything (minus xattrs and acls), but it will sync owner and group based on their ID, not on string representation. So if you synced to a different system that doesn't necessarily have the same users/groups, you should add the --numeric-owner flag to tar
  • tar will include the filename of the directory you're checking itself, just something to be aware of.

As long as there is no fix for the first problem (or unless you're sure it does not affect you), I would not use this approach.

The find based solutions proposed above are also no good because they only include files, not directories, which becomes an issue if you the checksumming should keep in mind empty directories.

Finally, most suggested solutions don't sort consistently, because the collation might be different across systems.

This is the solution I came up with:

dir=<mydir>; (find "$dir" -type f -exec md5sum {} +; find "$dir" -type d) | LC_ALL=C sort | md5sum

Notes about this solution:

  • The LC_ALL=C is to ensure reliable sorting order across systems
  • This doesn't differentiate between a directory "named\nwithanewline" and two directories "named" and "withanewline", but the chance of that occuring seems very unlikely. One usually fixes this with a -print0 flag for find but since there's other stuff going on here, I can only see solutions that would make the command more complicated then it's worth.

PS: one of my systems uses a limited busybox find which does not support -exec nor -print0 flags, and also it appends '/' to denote directories, while findutils find doesn't seem to, so for this machine I need to run:

dir=<mydir>; (find "$dir" -type f | while read f; do md5sum "$f"; done; find "$dir" -type d | sed 's#/$##') | LC_ALL=C sort | md5sum

Luckily, I have no files/directories with newlines in their names, so this is not an issue on that system.




回答4:


If you only care about files and not empty directories, this works nicely:

find /path -type f | sort -u | xargs cat | md5sum



回答5:


For the sake of completeness, there's md5deep(1); it's not directly applicable due to *.py filter requirement but should do fine together with find(1).




回答6:


A solution which worked best for me:

find "$path" -type f -print0 | sort -z | xargs -r0 md5sum | md5sum

Reason why it worked best for me:

  1. handles file names containing spaces
  2. Ignores filesystem meta-data
  3. Detects if file has been renamed

Issues with other answers:

Filesystem meta-data is not ignored for:

tar c - "$path" | md5sum

Does not handle file names containing spaces nor detects if file has been renamed:

find /path -type f | sort -u | xargs cat | md5sum



回答7:


If you want one md5sum spanning the whole directory, I would do something like

cat *.py | md5sum 



回答8:


Checksum all files, including both content and their filenames

grep -ar -e . /your/dir | md5sum | cut -c-32

Same as above, but only including *.py files

grep -ar -e . --include="*.py" /your/dir | md5sum | cut -c-32

You can also follow symlinks if you want

grep -aR -e . /your/dir | md5sum | cut -c-32

Other options you could consider using with grep

-s, --no-messages         suppress error messages
-D, --devices=ACTION      how to handle devices, FIFOs and sockets;
-Z, --null                print 0 byte after FILE name
-U, --binary              do not strip CR characters at EOL (MSDOS/Windows)



回答9:


GNU find

find /path -type f -name "*.py" -exec md5sum "{}" +;



回答10:


Technically you only need to run ls -lR *.py | md5sum. Unless you are worried about someone modifying the files and touching them back to their original dates and never changing the files' sizes, the output from ls should tell you if the file has changed. My unix-foo is weak so you might need some more command line parameters to get the create time and modification time to print. ls will also tell you if permissions on the files have changed (and I'm sure there are switches to turn that off if you don't care about that).




回答11:


Using md5deep:

md5deep -r FOLDER | awk '{print $1}' | sort | md5sum




回答12:


I had the same problem so I came up with this script that just lists the md5sums of the files in the directory and if it finds a subdirectory it runs again from there, for this to happen the script has to be able to run through the current directory or from a subdirectory if said argument is passed in $1

#!/bin/bash

if [ -z "$1" ] ; then

# loop in current dir
ls | while read line; do
  ecriv=`pwd`"/"$line
if [ -f $ecriv ] ; then
    md5sum "$ecriv"
elif [ -d $ecriv ] ; then
    sh myScript "$line" # call this script again
fi

done


else # if a directory is specified in argument $1

ls "$1" | while read line; do
  ecriv=`pwd`"/$1/"$line

if [ -f $ecriv ] ; then
    md5sum "$ecriv"

elif [ -d $ecriv ] ; then
    sh myScript "$line"
fi

done


fi



回答13:


If you want really independance from the filesystem attributes and from the bit-level differences of some tar versions, you could use cpio:

cpio -i -e theDirname | md5sum



回答14:


There are two more solutions:

Create:

du -csxb /path | md5sum > file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum > /tmp/file

Check:

du -csxb /path | md5sum -c file

ls -alR -I dev -I run -I sys -I tmp -I proc /path | md5sum -c /tmp/file



回答15:


md5sum worked fine for me, but I had issues with sort and sorting file names. So instead I sorted by md5sum result. I also needed to exclude some files in order to create comparable results.

find . -type f -print0 \ | xargs -r0 md5sum \ | grep -v ".env" \ | grep -v "vendor/autoload.php" \ | grep -v "vendor/composer/" \ | sort -d \ | md5sum



来源:https://stackoverflow.com/questions/1657232/how-can-i-calculate-an-md5-checksum-of-a-directory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!