How to find duplicated jpgs by content?

问题

I'd like to find and remove an image in a series of folders. The problem is that the image names are not necessarily the same.

What I did was to copy an arbitrary string from the images bytecode and use it like

grep -ir 'YA'uu�KU���^H2�Q�W^YSp��.�^H^\^Q��P^T' .

But since there are thousands of images this method lasts for ever. Also, some images are created by imagemagic of the original, so can not use size to find them all.

So I'm wondering what is the most efficient way to do so?

回答1:

Updated Answer

If you have the checksum of a specific file in mind that you want to compare with, you can checksum all files in all subdirectories and find the one that is the same:

find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"

Or this may work for you too:

find . -name \*.jpg -exec md5 {} \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"

Original Answer

The easiest way is to generate an md5 checksum once for each file. Depending on how your md5 program works, you would do something like this:

find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \;

94b48ea6e8ca3df05b9b66c0208d5184 ./a.jpg
f0361a81cfbe9e4194090b2f46db5dad ./b.jpg
c7e4f278095f40a5705739da65532739 ./c.jpg

Or maybe you can use

md5 -r *.jpg
94b48ea6e8ca3df05b9b66c0208d5184 a.jpg
f0361a81cfbe9e4194090b2f46db5dad b.jpg
c7e4f278095f40a5705739da65532739 c.jpg

Now you can use uniq to find all duplicates.

来源：https://stackoverflow.com/questions/34747987/how-to-find-duplicated-jpgs-by-content

标签

image

image-processing