问题
I have 3 folders - 1 master and 2 supplemental. I am writing a script that identifies duplicate files in all three via SHA1 hashing. For any duplicates found in master and supplementals (or their subdirectories), I would like to delete the files in the supplemental folders and keep the files in the master folder. If duplicate files are found in the supplemental folders and not the master folder, I would like to keep them and eventually merge with the master.
I have written a script (below) that successfully gets rid of duplicate files in the supplemental folders. However, it gets rid of all duplicates, even if the file is not found somewhere in the master folder tree. Logically, I am having trouble thinking of a way to delete the duplicate files in the supplementary folders ONLY if they already exist in the master folder. Any advice, suggestions, or tips would be much appreciated!
def deleteDups(maindirectory, pnhpdirectory, dupdirectories):
hashmap = {}
for path, dirs, files in os.walk(maindirectory):
for name in files:
fullname = os.path.join(path, name)
with open(fullname, 'rb') as f:
d = f.read()
h = hashlib.md5(d).hexdigest()
filelist = hashmap.setdefault(h, [])
filelist.append(fullname)
# delete records in dictionary that have only 1 item (meaning no duplicate)
for k, v in hashmap.items():
if len(v) == 1:
del hashmap[k]
# make dictionary into flat list
try:
dups = reduce(lambda x, y: x+y, hashmap.values())
paths = [] # list of all files in duplicate directories
for directory in dupdirectories:
for root, dirs, files in os.walk(directory):
for name in files:
paths.append(os.path.join(root, name))
# if file in directory is also in duplicates list, it will be deleted
DeletedFileSize = 0.00
for file in paths:
if file in dups:
FileSize = os.path.getsize(file)
DeletedFileSize = DeletedFileSize + FileSize
print "Deleting file: " + file
os.remove(file)
else:
pass
if DeletedFileSize == 0:
print "No duplicate files found"
print "Space saved: " + str(DeletedFileSize) + " gigabytes"
else:
DeletedFileSize = DeletedFileSize / 1073741824
print "Space saved: " + str(DeletedFileSize) + " gigabytes"
except TypeError:
print "No duplicate files found."
来源:https://stackoverflow.com/questions/38860276/deleting-duplicate-files-if-file-exists-in-certain-directories-python