I have about 5000 images with water marks on them and 5000 identical images with no watermarks. The file names of each set of images are not correlated to each other in any way.
I think this is more about performance then about the image comparison itself and the answer is written in such manner so if you need help with the comparison itself comment me ...
create simplified histogram for each image
let say 8 values per each channel limiting to 4 bits per each intensity level. That will lead to 3*8*4=3*32
bits per image
sort images
take above histogram and consider it as a single number and sort the images of A
group by it does not matter if ascending or descending
matching A
and B
grouped images
now the corresponding images should have similar histograms so take image from unsorted group B
(watermarked), bin-search all the closest match in A
group (original) and then compare more with more robust methods just against selected images instead of 5000
.
add flag if image from A
group is already matched
so you can ignore already matched images in bullet #3 to gain more speed
[Notes]
there are other ways to improvement like use Perceptual hash algorithms
You can use the OpenCV library. It can be used in Java. Please follow http://docs.opencv.org/doc/tutorials/introduction/desktop_java/java_dev_intro.html
Regarding image compare, you can see another useful answer here: Checking images for similarity with OpenCV