问题
I am looking to use java or groovy to get the md5 checksum of a complete directory.
I have to copy directories for source to target, checksum source and target, and after delete source directories.
I find this script for files, but how to do the same thing with directories ?
import java.security.MessageDigest
def generateMD5(final file) {
MessageDigest digest = MessageDigest.getInstance("MD5")
file.withInputStream(){ is ->
byte[] buffer = new byte[8192]
int read = 0
while( (read = is.read(buffer)) > 0) {
digest.update(buffer, 0, read);
}
}
byte[] md5sum = digest.digest()
BigInteger bigInt = new BigInteger(1, md5sum)
return bigInt.toString(16).padLeft(32, '0')
}
Is there a better approach ?
回答1:
I had the same requirement and chose my 'directory hash' to be an MD5 hash of the concatenated streams of all (non-directory) files within the directory. As crozin mentioned in comments on a similar question, you can use SequenceInputStream
to act as a stream concatenating a load of other streams. I'm using Apache Commons Codec for the MD5 algorithm.
Basically, you recurse through the directory tree, adding FileInputStream
instances to a Vector
for non-directory files. Vector
then conveniently has the elements()
method to provide the Enumeration
that SequenceInputStream
needs to loop through. To the MD5 algorithm, this just appears as one InputStream
.
A gotcha is that you need the files presented in the same order every time for the hash to be the same with the same inputs. The listFiles()
method in File
doesn't guarantee an ordering, so I sort by filename.
I was doing this for SVN controlled files, and wanted to avoid hashing the hidden SVN files, so I implemented a flag to avoid hidden files.
The relevant basic code is as below. (Obviously it could be 'hardened'.)
import org.apache.commons.codec.digest.DigestUtils;
import java.io.*;
import java.util.*;
public String calcMD5HashForDir(File dirToHash, boolean includeHiddenFiles) {
assert (dirToHash.isDirectory());
Vector<FileInputStream> fileStreams = new Vector<FileInputStream>();
System.out.println("Found files for hashing:");
collectInputStreams(dirToHash, fileStreams, includeHiddenFiles);
SequenceInputStream seqStream =
new SequenceInputStream(fileStreams.elements());
try {
String md5Hash = DigestUtils.md5Hex(seqStream);
seqStream.close();
return md5Hash;
}
catch (IOException e) {
throw new RuntimeException("Error reading files to hash in "
+ dirToHash.getAbsolutePath(), e);
}
}
private void collectInputStreams(File dir,
List<FileInputStream> foundStreams,
boolean includeHiddenFiles) {
File[] fileList = dir.listFiles();
Arrays.sort(fileList, // Need in reproducible order
new Comparator<File>() {
public int compare(File f1, File f2) {
return f1.getName().compareTo(f2.getName());
}
});
for (File f : fileList) {
if (!includeHiddenFiles && f.getName().startsWith(".")) {
// Skip it
}
else if (f.isDirectory()) {
collectInputStreams(f, foundStreams, includeHiddenFiles);
}
else {
try {
System.out.println("\t" + f.getAbsolutePath());
foundStreams.add(new FileInputStream(f));
}
catch (FileNotFoundException e) {
throw new AssertionError(e.getMessage()
+ ": file should never not be found!");
}
}
}
}
回答2:
I made a function to calculate MD5 checksum on Directory :
First, I'm using FastMD5: http://www.twmacinta.com/myjava/fast_md5.php
Here is my code :
def MD5HashDirectory(String fileDir) {
MD5 md5 = new MD5();
new File(fileDir).eachFileRecurse{ file ->
if (file.isFile()) {
String hashFile = MD5.asHex(MD5.getHash(new File(file.path)));
md5.Update(hashFile, null);
}
}
String hashFolder = md5.asHex();
return hashFolder
}
回答3:
Based on Stuart Rossiter's answer but clean code and hidden files properly handled:
import org.apache.commons.codec.digest.DigestUtils;
import java.io.*;
import java.nio.file.Files;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;
import java.util.Vector;
public class Hashing
{
public static String hashDirectory(String directoryPath, boolean includeHiddenFiles) throws IOException
{
File directory = new File(directoryPath);
if (!directory.isDirectory())
{
throw new IllegalArgumentException("Not a directory");
}
Vector<FileInputStream> fileStreams = new Vector<>();
collectFiles(directory, fileStreams, includeHiddenFiles);
try (SequenceInputStream sequenceInputStream = new SequenceInputStream(fileStreams.elements()))
{
return DigestUtils.md5Hex(sequenceInputStream);
}
}
private static void collectFiles(File directory,
List<FileInputStream> fileInputStreams,
boolean includeHiddenFiles) throws IOException
{
File[] files = directory.listFiles();
if (files != null)
{
Arrays.sort(files, Comparator.comparing(File::getName));
for (File file : files)
{
if (includeHiddenFiles || !Files.isHidden(file.toPath()))
{
if (file.isDirectory())
{
collectFiles(file, fileInputStreams, includeHiddenFiles);
} else
{
fileInputStreams.add(new FileInputStream(file));
}
}
}
}
}
}
回答4:
HashCopy is a Java application. It can generate and verify MD5 and SHA on a single file or a directory recursively. I am not sure if it has an API. It can be downloaded from www.jdxsoftware.org.
回答5:
It's not clear what it means to take the md5sum of a directory. You might want the checksum of the file listing; you might want the checksum of the file listings and their contents. If you're already summing the file data themselves, I'd suggest you spec an unambiguous representation for a directory listing (watch out for evil characters in filenames), then compute and hash that each time. You also need to consider how you will handle special files (sockets, pipes, devices and symlinks in the unix world; NTFS has file streams and I believe something akin to symlinks as well).
来源:https://stackoverflow.com/questions/9169137/java-hash-a-folder