How to find duplicated jpgs by content?

久未见 提交于 2021-02-11 12:40:24

问题


I'd like to find and remove an image in a series of folders. The problem is that the image names are not necessarily the same.

What I did was to copy an arbitrary string from the images bytecode and use it like

grep -ir 'YA'uu�KU���^H2�Q�W^YSp��.�^H^\^Q��P^T' .

But since there are thousands of images this method lasts for ever. Also, some images are created by imagemagic of the original, so can not use size to find them all.

So I'm wondering what is the most efficient way to do so?


回答1:


Updated Answer

If you have the checksum of a specific file in mind that you want to compare with, you can checksum all files in all subdirectories and find the one that is the same:

find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"

Or this may work for you too:

find . -name \*.jpg -exec md5 {} \; | grep "94b48ea6e8ca3df05b9b66c0208d5184"

Original Answer

The easiest way is to generate an md5 checksum once for each file. Depending on how your md5 program works, you would do something like this:

find . -name \*.jpg -exec bash -c 's=$(md5 < {}); echo $s {}' \;

94b48ea6e8ca3df05b9b66c0208d5184 ./a.jpg
f0361a81cfbe9e4194090b2f46db5dad ./b.jpg
c7e4f278095f40a5705739da65532739 ./c.jpg

Or maybe you can use

md5 -r *.jpg
94b48ea6e8ca3df05b9b66c0208d5184 a.jpg
f0361a81cfbe9e4194090b2f46db5dad b.jpg
c7e4f278095f40a5705739da65532739 c.jpg

Now you can use uniq to find all duplicates.



来源:https://stackoverflow.com/questions/34747987/how-to-find-duplicated-jpgs-by-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!