Identifying 2 same images using Java

前端 未结 10 2364
别那么骄傲
别那么骄傲 2020-12-28 21:10

I have a problem in my web crawler where I am trying to retrieve images from a particular website. Problem is that often I see images that are exactly same but different in

相关标签:
10条回答
  • 2020-12-28 21:16

    I've done something very similar to this before in Java and I found that the PixelGrabber class inside the java.awt.image package of the api is extremely helpful (if not downright necessary).

    Additionally you would definitely want to check out the ColorConvertOp class which can performs a pixel-by-pixel color conversion of the data in the source image and the resulting color values are scaled to the precision of the destination image. The documentation goes on to say that the images can even be the same image in which case it would be quite simple to detect if they are identical.

    If you were detecting similarity, you need to use some form of averaging method as mentioned in the answer to this question

    If you can, also check out Volume 2 chapter 7 of Horstman's Core Java (8th ed) because there's a whole bunch of examples on image transformations and the like, but again, make sure to poke around the java.awt.image package because you should find you have almost everything prepared for you :)

    G'luck!

    0 讨论(0)
  • 2020-12-28 21:18

    calculate MD5s using something like this:

    MessageDigest m=MessageDigest.getInstance("MD5");
    m.update(image.getBytes(),0,image.length());
    System.out.println("MD5: "+new BigInteger(1,m.digest()).toString(16));
    

    Put them in a hashmap.

    0 讨论(0)
  • 2020-12-28 21:22

    Inspect the response headers and interrogate the HTTP header ETag value, if present. (RFC2616: ETag) They maybe the same for identical images coming from your target web server. This is because the ETag value is often a message digest like MD5, which would allow you to take advantage of the webserver's already completed computations.

    This may potentially allow you to not even download the image!

    for each imageUrl in myList
        Perform HTTP HEAD imageUrl
        Pull ETag value from request
        If ETag is in my map of known ETags
           move on to next image
        Else
           Download image
           Store ETag in map
    

    Of course the ETag has to be present and if not, well the idea is toast. But maybe you have pull with the web server admins?

    0 讨论(0)
  • 2020-12-28 21:27

    You can compare images using:

    1) simple pixel by pixel comparison

    It will not give very good results when there is some shift, rotation, illumination change, ...

    2) Relatively simple but more advanced approach

    http://www.lac.inpe.br/JIPCookbook/6050-howto-compareimages.jsp

    3) More advanced algorithms

    For example RadpiMiner and IMMI extension contains several image comparison algorithms, you can experiment with different approaches and select, which suits you best for your purpose...

    0 讨论(0)
  • 2020-12-28 21:27

    Hashing is already suggested and recognizing if two files are identical is very easy, but you said pixel level. If you want to recognize two images even if they are in different formats (.png/.jpg/.gif/..) and even if they were scaled I suggest: (using an image library and if the image are medium/big no 16x16 icons):

    1. scale the image to some fixed size, it depends on the samples
    2. transform it to grey scale using the RGB-YUV conversion for exampel and taking Y from there (very easy) 3 Do the hamming distance of each image and set a threshold to decide if they are the same or not.

    You will do a sum of the difference of all the grey pixels of both images you get a number if the difference is < T you consider both images identical

    --

    0 讨论(0)
  • 2020-12-28 21:33

    Look at the MessageDigest class. Essentially, you create an instance of it, then pass it a series of bytes. The bytes could be the bytes directly loaded from the URL if you know that two images that are the "same" will be the selfsame file/stream of bytes. Or if necessary, you could create a BufferedImage from the stream, then pull out pixel values, something like:

      MessageDigest md = MessageDigest.getInstance("MD5");
      ByteBuffer bb = ByteBuffer.allocate(4 * bimg.getWidth());
      for (int y = bimg.getHeight()-1; y >= 0; y--) {
        bb.clear();
        for (int x = bimg.getWidth()-1; x >= 0; x--) {
          bb.putInt(bimg.getRGB(x, y));
        }
        md.update(bb.array());
      }
      byte[] digBytes = md.digest();
    

    Either way, MessageDigest.digest() eventually gives you a byte array which is the "signature" of the image. You could convert this to a hex string if it's helpful, e.g. for putting in a HashMap or database table, e.g.:

    StringBuilder sb = new StringBuilder();
    for (byte b : digBytes) {
      sb.append(String.format("%02X", b & 0xff));
    }
    String signature = sb.toString();
    

    If the content/image from two URLs gives you the same signature, then they're the same image.

    Edit: I forgot to mention that if you were hashing pixel values, you'd probably want to include the dimensions of the image in the hash too. (Just to a similar thing-- write two ints to an 8-byte ByteBuffer, then update the MessageDigest with the corresponding 8-byte array.)

    The other thing is that somebody mentioned is that MD5 is not collision-resistent. In other words, there is a technique for constructing multiple byte sequences with the same MD5 hash without having to use the "brute force" method of trial and error (where on average, you'd expect to have to try about 2^64 or 16 billion billion files before hitting on a collision). That makes MD5 unsuitable where you're trying to protect against this threat model. If you're not concerned about the case where somebody might deliberately try to fool your duplicate identification, and you're just worried about the chances of a duplicate hash "by chance", then MD5 is absolutely fine. Actually, it's not only fine, it's actually a bit over the top-- as I say, on average, you'd expect one "false duplicate" after about 16 billion billion files. Or put another way, you could have, say, a billion files and the chance of a collision be extremely close to zero.

    If you are worried about the threat model outlined (i.e. you think somebody could be deliberately dedicating processor time to constructing files to fool your system), then the solution is to use a stronger hash. Java supports SHA1 out of the box (just replace "MD5" with "SHA1"). This will now give you longer hashes (160 bits instead of 128 bits), but with current knowledge makes finding a collision infeasible.

    Personally for this purpose, I would even consider just using a decent 64-bit hash function. That'll still allow tens of millions of images to be compared with close-to-zero chance of a false positive.

    0 讨论(0)
提交回复
热议问题