Determine whether any files have been added, removed, or modified in a directory

痴心易碎 提交于 2019-11-30 04:01:17

问题


I'm trying to write a Python script that will get the md5sum of all files in a directory (in Linux). Which I believe I have done in the code below.

I want to be able to run this to make sure no files within the directory have changed, and no files have been added for deleted.

The problem is if I make a change to a file in the directory but then change it back. I get a different result from running the function below. (Even though I changed the modified file back.

Can anyone explain this. And let me know if you can think of a work-around?

def get_dir_md5(dir_path):
    """Build a tar file of the directory and return its md5 sum"""
    temp_tar_path = 'tests.tar'
    t = tarfile.TarFile(temp_tar_path,mode='w')  
    t.add(dir_path)
    t.close()

    m = hashlib.md5()
    m.update(open(temp_tar_path,'rb').read())
    ret_str = m.hexdigest()

    #delete tar file
    os.remove(temp_tar_path)
    return ret_str

Edit: As these fine folks have answered, it looks like tar includes header information like date modified. Would using zip work any differently or another format?

Any other ideas for work arounds?


回答1:


As the other answers mentioned, two tar files can be different even if the contents are the same either due to tar metadata changes or to file order changes. You should run the checksum on the file data directly, sorting the directory lists to ensure they are always in the same order. If you want to include some metadata in the checksum, include it manually.

Untested example using os.walk:

import os
import os.path
def get_dir_md5(dir_root):
    """Build a tar file of the directory and return its md5 sum"""

    hash = hashlib.md5()
    for dirpath, dirnames, filenames in os.walk(dir_root, topdown=True):

        dirnames.sort(key=os.path.normcase)
        filenames.sort(key=os.path.normcase)

        for filename in filenames:
            filepath = os.path.join(dirpath, filename)

            # If some metadata is required, add it to the checksum

            # 1) filename (good idea)
            # hash.update(os.path.normcase(os.path.relpath(filepath, dir_root))

            # 2) mtime (possibly a bad idea)
            # st = os.stat(filepath)
            # hash.update(struct.pack('d', st.st_mtime))

            # 3) size (good idea perhaps)
            # hash.update(bytes(st.st_size))

            f = open(filepath, 'rb')
            for chunk in iter(lambda: f.read(65536), b''):
                hash.update(chunk)

    return hash.hexdigest()



回答2:


TAR file headers include a field for the modified time of the file; the act of changing a file, even if that change is later changed back, will mean the TAR file headers will be different, leading to different hashes.




回答3:


You do not need to make the TAR file to do what you propose.

Here is your workaround algorithm:

  1. Walk the directory tree;
  2. Take the md5 signature of each file;
  3. Sort the signatures;
  4. Take the md5 signature of the text string of all the signatures of the individual files.

The single resulting signature will be what you are looking for.

Heck, you don't even need Python. You can do this:

find /path/to/dir/ -type f -name *.py -exec md5sum {} + | awk '{print $1}'\
| sort | md5sum



回答4:


tar files contain metadata beyond the actual file contents, such as file access times, modification times, etc. Even if the file contents don't change, the tar file will in fact be different.



来源:https://stackoverflow.com/questions/7325072/determine-whether-any-files-have-been-added-removed-or-modified-in-a-directory

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!