I have a problem in my web crawler where I am trying to retrieve images from a particular website. Problem is that often I see images that are exactly same but different in
I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.
It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.
https://github.com/srch07/Duplicate-Image-Finder-API
Jar with dependencies bundled internally can be downloaded from, https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar
The api can find duplicates among images of different sizes too.
Depending on how detailed you want to get with it:
Regardless of if you want to do all that or not you need to:
No need to rely on any special imaging libraries, images are just bytes.
You could also generate a MD5 signature of the file and ignore duplicate entries. Won't help you find similar images though.
I would think you don't need an image library to do this - simply fetching the URL content and comparing the two streams as byte arrays should do it.
Unless of course you are interested in identifying similar images as well.