How do I find duplicate files by comparing them by size (ie: not hashing) in bash

问题

How do I find duplicate files by comparing them by size (ie: not hashing) in bash.

Testbed files:

-rw-r--r--   1 usern  users  68239 May  3 12:29 The W.pdf
-rw-r--r--   1 usern  users  68239 May  3 12:29 W.pdf
-rw-r--r--   1 usern  users      8 May  3 13:43 X.pdf

Yes, files can have spaces (Boo!).

I want to check files in the same directory, move the ones which match something else into 'these are probably duplicates' folder.

My probable use-case is going to have humans randomly mis-naming a smaller set of files (ie: not generating files of arbitrary length). It is fairly unlikely that two files will be the same size and yet be different files. Sure, as a backup I could hash and check two files of identical size. But mostly, it will be people taking a file and misnaming it / re-adding it to a pile, of which it is already there.

So, preferably a solution with widely installed tools (posix?). And I'm not supposed to parse the output of ls, so I need another way to get actual size (and not a du approximate).

"Vote to close!"

Hold up cowboy.

I bet you're going to suggest this (cool, you can google search):

https://unix.stackexchange.com/questions/71176/find-duplicate-files

No fdupes (nor jdupes, nor...), nor finddup, nor rmlint, nor fslint - I can't guarantee those on other systems (much less mine), and I don't want to be stuck as customer support dealing with installing them on random systems from now to eternity, nor even in getting emails about that sh...stuff and having to tell them to RTFM and figure it out. Plus, in reality, I should write my script to test functionality of what is installed, but, that's beyond the scope.

https://unix.stackexchange.com/questions/192701/how-to-remove-duplicate-files-using-bash

All these solutions want to start by hashing. Some cool ideas in some of these: hash just a chunk of both files, starting somewhere past the header, then only do full compare if those turn up matching. Good idea for double checking work, but would prefer to only do that on the very, very few that actually are duplicate. As, looking over the first several thousand of these by hand, not one duplicate has been even close to a different file.

https://unix.stackexchange.com/questions/277697/whats-the-quickest-way-to-find-duplicated-files

Proposed:

$find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate

Breaks for me:

find: unknown option -- n
usage: find [-dHhLXx] [-f path] path ... [expression]
uniq: unknown option -- w
usage: uniq [-ci] [-d | -u] [-f fields] [-s chars] [input_file [output_file]]
find: unknown option -- t
usage: find [-dHhLXx] [-f path] path ... [expression]
xargs: md5sum: No such file or directory

https://unix.stackexchange.com/questions/170693/compare-directory-trees-regarding-file-name-and-size-and-date

Haven't been able to figure out how rsync -nrvc --delete might work in the same directory, but there might be solution in there.

Well how about cmp? Yeah, that looks pretty good, actually!

cmp -z file1 file2

Bummer, my version of cmp does not include the -z size option.

However, I tried implementing it just for grins - and when it failed, looking at it I realized that I also need help constructing my loop logic. Removing things from my loops in the midst of processing them is probably a recipe for breakage, duh.

if [ ! -d ../Dupes/ ]; then
mkdir ../Dupes/ || exit 1       # Cuz no set -e, and trap not working
fi
for i in ./*
do
  for j in ./*
  do
  if [[ "$i" != "$j" ]]; then       # Yes, it will be identical to itself
     if [[ $(cmp -s "$i" "$j") ]]; then
        echo "null"         # Cuz I can't use negative of the comparison?
     else
        mv -i "$i" ../Dupes/
     fi
  fi   
  done   
done

https://unix.stackexchange.com/questions/367749/how-to-find-and-delete-duplicate-files-within-the-same-directory

Might have something I could use, but I'm not following what's going on in there.

https://superuser.com/questions/259148/bash-find-duplicate-files-mac-linux-compatible

If it were something that returns size, instead of md5, maybe one of the answers in here?

https://unix.stackexchange.com/questions/570305/what-is-the-most-efficient-way-to-find-duplicate-files

Didn't really get answered.

TIL: Sending errors from . scriptname will close my terminal instantly. Thanks, Google!

TIL: Sending errors from scripts executed via $PATH will close the terminal if shopt -s extdebug + trap checkcommand DEBUG are set in profile to try and catch rm -r * - but at least will respect my alias for exit

TIL: Backticks deprecated, use $(things) - Ugh, so much re-writing to do :P

TIL: How to catch non-ascii characters in filenames, without using basename

TIL: "${file##*/}"

TIL: file - yes, X.pdf is not a PDF.

回答1:

On the matter of POSIX

I'm afraid you cannot get the actual file size (not the number of blocks allocated by the file) in a plain posix shell without using ls. All the solutions like du --apparent-size, find -printf %s, and stat are not posix.
However, as long as your filenames don't contain linebreaks (spaces are ok) you could create safe solutions relying on ls. Correctly handling filenames with linebreaks would require very non-posix tools (like GNU sort -z) anyway.

Bash+POSIX Approach Actually Comparing The Files

I would drop the approach to compare only the file sizes and use cmp instead. For huge directories the posix script will be slow no matter what you do. Also, I expect cmp to do some fail fast checks (like comparing the file sizes) before actually comparing the file contents. For common scenarios with only a few files speed shouldn't matter anyway as even the worst script will run fast enough.

The following script places each group of actual duplicates (at least two, but can be more) into its own subdirectory of dups/. The script should work with all filenames; spaces, special symbols, and even linebreaks are ok. Note that we are still using bash (which is not posix). We just assume that all tools (like mv, find, ...) are posix.

#! /usr/bin/env bash
files=()
for f in *; do [ -f "$f" ] && files+=("$f"); done
max=${#files[@]}
for (( i = 0; i < max; i++ )); do
    sameAsFileI=()
    for (( j = i + 1; j < max; j++ )); do
        cmp -s "${files[i]}" "${files[j]}" &&
        sameAsFileI+=("${files[j]}") &&
        unset 'files[j]'
    done
    (( ${#sameAsFileI[@]} == 0 )) && continue
    mkdir -p "dups/$i/"
    mv "${files[i]}" "${sameAsFileI[@]}" "dups/$i/"
    # no need to unset files[i] because loops won't visit this entry again
    files=("${files[@]}") # un-sparsify array
    max=${#files[@]}
done

Fairly Portable Non-POSIX Approach Using File Sizes Only

If you need a faster approach that only compares the file sizes I suggest to not use a nested loop. Loops in bash are slow already, but if you nest them you have quadratic time complexity. It is faster and easier to ...

print only the file sizes without file names
apply sort | uniq -d to retrieve duplicates in time O(n log n)
Move all files having one of the duplicated sizes to a directory

This solution is not strictly posix conform. However, I tried to verify, that the tools and options in this solution are supported by most implementations. Your find has to support the non-posix options -maxdepth and -printf with %s for the actual file size and %f for the file basename (%p for the full path would be acceptable too).

The following script places all files of the same size into the directory potential-dups/. If there are two files of size n and two files of size m all four files end up in this single directory. The script should work with all file names expect those with linebreaks (that is \n; \r should be fine though).

#! /usr/bin/env sh
all=$(find . -maxdepth 1 -type f -printf '%s %f\n' | sort)
dupRegex=$(printf %s\\n "$all" | cut -d' ' -f1 | uniq -d |
  sed -e 's/[][\.|$(){}?+*^]/\\&/g' -e 's/^/^/' | tr '\n' '|' | sed 's/|$//')
[ -z "$dupRegex" ] && exit
mkdir -p potential-dups
printf %s\\n "$all" | grep -E "$dupRegex" | cut -d' ' -f2- |
  sed 's/./\\&/' | xargs -I_ mv _ potential-dups

In case you wonder about some of the sed commands: They quote the file names such that spaces and special symbols are processed correctly by subsequent tools. sed 's/[][\.|$(){}?+*^]/\\&/g' is for turning raw strings into equivalent extended regular expressions (ERE) and sed 's/./\\&/' is for literal processing by xargs. See the posix documentation of xargs:

-I replstr [...] Any <blank>s at the beginning of each line shall be ignored.
[...]
Note that the quoting rules used by xargs are not the same as in the shell. [...] An easy rule that can be used to transform any string into a quoted form that xargs interprets correctly is to precede each character in the string with a backslash.

来源：https://stackoverflow.com/questions/61584817/how-do-i-find-duplicate-files-by-comparing-them-by-size-ie-not-hashing-in-bas

标签

bash

scripting

duplicates

size