Identifying 2 same images using Java

前端 未结 10 2365
别那么骄傲
别那么骄傲 2020-12-28 21:10

I have a problem in my web crawler where I am trying to retrieve images from a particular website. Problem is that often I see images that are exactly same but different in

相关标签:
10条回答
  • 2020-12-28 21:33

    I wrote a pure java library just for this few days back. You can feed it with directory path(includes sub-directory), and it will list the duplicate images in list with absolute path which you want to delete. Alternatively, you can use it to find all unique images in a directory too.

    It used awt api internally, so can't be used for Android though. Since, imageIO has problem reading alot of new types of images, i am using twelve monkeys jar which is internally used.

    https://github.com/srch07/Duplicate-Image-Finder-API

    Jar with dependencies bundled internally can be downloaded from, https://github.com/srch07/Duplicate-Image-Finder-API/blob/master/archives/duplicate_image_finder_1.0.jar

    The api can find duplicates among images of different sizes too.

    0 讨论(0)
  • 2020-12-28 21:37

    Depending on how detailed you want to get with it:

    • download the image
    • as you download it generate a hash for it
    • make a directory where the directory name is the hash value (if the directory does not exist)
    • if directory contains 2 or more files then compare the file sizes
    • if the file sizes are the same then do a byte by byte comparison of the image to the bytes of the images in the file
    • if the bytes are unique then you have a new image

    Regardless of if you want to do all that or not you need to:

    • download the images
    • do a byte-by-byte comparison of the images

    No need to rely on any special imaging libraries, images are just bytes.

    0 讨论(0)
  • 2020-12-28 21:41

    You could also generate a MD5 signature of the file and ignore duplicate entries. Won't help you find similar images though.

    0 讨论(0)
  • 2020-12-28 21:41

    I would think you don't need an image library to do this - simply fetching the URL content and comparing the two streams as byte arrays should do it.

    Unless of course you are interested in identifying similar images as well.

    0 讨论(0)
提交回复
热议问题