Simple hash of PIL image

前端 未结 2 1207
醉话见心
醉话见心 2020-12-31 21:41

Background

I want to store information of PIL images in a key-value store. For that, I hash the image and use the hash as a key.

What I tried

I h

相关标签:
2条回答
  • 2020-12-31 21:58

    Recognising what you say about timestamps, ImageMagick has exactly such a feature. First, an example.

    Here I create two images with identical pixels but a timestamp at least 1 second different:

    convert -size 600x100 gradient:magenta-cyan 1.png
    sleep 2
    convert -size 600x100 gradient:magenta-cyan 2.png
    

    If I checksum them on macOS, it tells me they are different because of the embedded timestamp:

    md5 -r [12].png
    
    c7454aa225e3e368abeb5290b1d7a080 1.png
    66cb4de0b315505de528fb338779d983 2.png
    

    But if I checksum just the pixels with ImageMagick, (where %# is the pixel-wise checksum), it knows the pixels are identical and I get:

    identify -format '%# - %f\n' 1.png 2.png
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 1.png
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 2.png
    

    And, in fact, if I make a TIFF file with the same image contents, whether with Motorola or Intel byte order, or a NetPBM PPM file:

    convert -size 600x100 gradient:magenta-cyan -define tiff:endian=msb 3motorola.tif
    convert -size 600x100 gradient:magenta-cyan -define tiff:endian=lsb 3intel.tif
    convert -size 600x100 gradient:magenta-cyan 3.ppm
    

    ImageMagick knows they are the same, despite different file format, CPU architecture and timestamp,:

    identify -format '%# - %f\n' 1.png 3.ppm 3{motorola,intel}.tif
    
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 1.png
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3.ppm
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3motorola.tif
    70680e2827ad671f3732c0e1c2e1d33acb957bc0d9e3a43094783b4049225ea5 - 3intel.tif
    

    So, in answer to your question, I am suggesting you shell out to ImageMagick with the Python subprocess module and use ImageMagick.

    0 讨论(0)
  • 2020-12-31 22:14

    I'm guessing your goal is to perform image hashing in Python (which is much different than classic hashing, since byte representation of images is dependent on format, resolution and etc.)

    One of the image hashing techniques would be average hashing. Make sure that this is not 100% accurate, but it works fine in most of the cases.


    First we simplify the image by reducing its size and colors, reducing complexity of the image massively contributes to accuracy of comparison between other images:

    Reducing size:

    img = img.resize((10, 10), Image.ANTIALIAS)

    Reducing colors:

    img = img.convert("L")

    Then, we find average pixel value of the image (which is obviously one of the main components of the average hashing):

    pixel_data = list(img.getdata())
    avg_pixel = sum(pixel_data)/len(pixel_data)
    

    Finally hash is computed, we compare each pixel in the image to the average pixel value. If pixel is more than or equal to average pixel then we get 1, else it is 0. Then we convert these bits to base 16 representation:

    bits = "".join(['1' if (px >= avg_pixel) else '0' for px in pixel_data])
    hex_representation = str(hex(int(bits, 2)))[2:][::-1].upper()
    

    If you want to compare this image to other images, you perform actions above, and find similarity between hexadecimal representation of average hashed images. You can use something as simple as hamming distance or more complex algorithms such as Levenshtein distance, Ratcliff/Obershelp pattern recognition (SequenceMatcher), Cosine Similarity etc.

    0 讨论(0)
提交回复
热议问题