Removing Duplicate Images [closed]

前端未结

关注

 15  2297

醉酒成梦

相关标签:

15条回答

花落未央

2020-12-07 11:16
A very simple approach is the following:
- Convert the image to greyscale in memory, so every pixel is only a number between 0 (black) and 255 (white).
- Scale the image to a fixed size. Finding the right size is important, you should play around with different sizes. E.g. you could scale each image to 64x64 pixels, but you may get better or worse results with either smaller or bigger pictures.
- Once you've done this for all images (yes, that will take a while), always load two images in memory and subtract them from each other. That is subtract the value of pixel (0,0) in image A ob the value of pixel (0,0) in image B, now do the same for (0,1) in both and so on. The resulting value might be positive or negative, you should always store the absolute value (so 5 results in 5, -8 however results in 8).
- Now you have a third image being the "difference image" (delta image) of image A and B. If they were identical, the delta image is all black (all values will subtract to zero). The "less black" it is, the less identical the images are. You need to find a good threshold, since even if the images are in fact identical (to your eyes), by scaling, altering brightness and so on, the delta image will not be totally black, it will however have only very dark greytones. So you need a threshold that says "If average error (delta image brightness) is below a certain value, there is still a good chance they might be identical, however if it is above that value, they are most likely not. Finding the right threshold is as hard as finding the right scaling size. You will always have false positives (images deemed to be identical, though they are not at all) and false negatives (images deemed to be not identical, although they are).
This algorithm is ultra slow. Actually only creating the greyscale images takes tons of time. Then you need to compare each GS image to each other one, again, tons of time. Also storing all the GS images takes a lot of disk space. So this algorithm is very bad, but the results are not that bad, even though its that simple. While the results are not amazing, they are better than I had initially thought.

The only way to get even better results is to use advanced image processing and here it starts getting really complicated. It involves a lot of math (a real lot of it); there are good applications (dupe finders) for many systems that have these implemented, so unless you must program it yourself, you are probably better off using one of these solutions. I read a lot papers on this topic but I'm afraid most of this goes beyond my horizon. Even the algorithms I might be able to implement according to these papers are beyond it; that means I understand what needs to be done, but I have no idea why it works or how it actually works, it's just magic ;-)
0 讨论(0)
发布评论:

提交评论
- 加载中...
清歌不尽

2020-12-07 11:18

It sounds like a procedural problem rather than a programming problem. Who uploads the photos? You or the customers? If you are uploading the photo, standardize the dimensions to a fixed scale and file format. That way comparisons will be easier. However, as it stands, unless you have days - or even weeks of free time - I suggest that you instead manually remove the duplicates images by either yourself or your team by visually comparing the images.

Perhaps you should group the images by location since it is a tourist images.

0 讨论(0)
发布评论:

提交评论
- 加载中...
天命终不由人

2020-12-07 11:20

Thinking outside the box, you may be able to use image metadata to narrow down your dataset. For example, your images may have fields showing the date and time the image was taken, down to the nearest second. Duplicates are likely to have identical values. A tool such as exiv2 could be used to dump out this data to a more convenient and sortable text format (with a little knowledge of batch/shell scripting).

Even fields such as the camera manufacturer and model could be used to reduce a set of 1,000,000 images to say 100 sets of 10,000 images, a significant improvement.

0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2 3

热议问题