How can I recognize slightly modified images?

前提是你 提交于 2019-11-28 17:34:41

Have a look at O. Chum, J. Philbin, and A. Zisserman, Near duplicate image detection: min-hash and tf-idf weighting, in Proceedings of the British Machine Vision Conference, 2008. They solve the problem you have and demonstrate the results for 146k images. However, I have no first-hand experience with their approach.

Naive idea: create a small thumbnail (50x50 pixels) to find "probably identical" images, then increase thumbnail size to discard more images.

Building on the idea of minHash...

My idea is to make 100 look-up tables using all the images currently in the database. The look-up tables are mapping from the brightness of a particular pixel to a list of images that have that same brightness in that same pixel. To search for an image just input it into the hash tables, get 100 lists, and score a point for each image when it shows up in a list. Each image will have a score from 0 to 100. The image with the most points wins.

There are many issues with how to do this within reasonable memory constraints and how to do it quickly. Proper data structures are needed for storage on disk. Tweaking of the hashing value, number of tables, etc, is possible, too. If more information is needed, I can expand on this.

My results have been very good. I'm able to index one million images in about 24 hours on one computer and I can lookup 20 images per second. Accuracy is astounding as far as I can tell.

I don't think this problem can be solved by hashing. Here's the difficulty: suppose you have a red pixel, and you want 3 and 5 to hash to the same value. Well, then you also want 5 and 7 to hash to the same value, and 7 and 9, and so on... you can construct a chain that says you want all pixels to hash to the same value.

Here's what I would try instead:

  1. Build a huge B-tree, with 32-way fanout at each node, containing all of the images.
  2. All images in the tree are the same size, or they're not duplicates.
  3. Give each colored pixel a unique number starting at zero. Upper left might be numbered 0, 1, 2 for the R, G, B components, or you might be better off with a random permutation, because you're going to compare images in order of that numbering.
  4. An internal node at depth n discriminates 32 ways on the value of the pixel n divided by 8 (this gets out some of the noise in nearby pixels.
  5. A leaf node contains some small number of images, let's say 10 to 100. Or maybe the number of images is an increasing function of depth, so that if you have 500 duplicates of one image, after a certain depth you stop trying to distinguish them.

One all two million nodes are inserted in the tree, two images are duplicate only if they're at the same node. Right? Wrong! If the pixel value in two images are 127 and 128, one goes into outedge 15 and the other goes into outedge 16. So actually when you discriminate on a pixel, you may insert that image into one or two children:

  • For brightness B, insert at B/8, (B-3)/8, and (B+3)/8. Sometimes all 3 will be equal, and always 2 of 3 will be equal. But with probability 3/8, you double the number of outedges on which the image appears. Depending on how deep things go you could have lots of extra nodes.

Someone else will have to do the math and see if you have to divide by something larger than 8 to keep images from being duplicated too much. The good news is that even if the true fanout is only around 4 instead of 32, you only need a tree of depth 10. Four duplications in 10 takes you up to 32 million images at the leaves. I hope you have plenty of RAM at your disposal! If not, you can put the tree in the filesystem.

Let me know how this goes!

Also good about hash from thumbnails: scaled duplicates are recognized (with little modification)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!