Align text for OCR

前端 未结 3 1004
萌比男神i
萌比男神i 2021-01-30 10:08

I am creating a database from historical records which I have as photographed pages from books (+100K pages). I wrote some python code to do some image processing before I OCR e

相关标签:
3条回答
  • 2021-01-30 10:20

    "align the text in the image" I suppose means to deskew the image so that text lines have the same baseline.

    I thoroughly enjoyed reading scientific answers to this quite overengineered task. Answers are great, but is it really necessary to spend so much time (very precious resource) to implement this? There is an abundance of tools available for this function without needing to write a single line of code (unless OP is a CS student and wants to practice the science, but obviously OP is doing this out of necessity to get all images processed). These methods took me back to my college years, but today I would use different tools to process this batch quickly and efficiently, which I do daily. I work for a high-volume document conversion and data extraction service bureau and OCR consulting company.

    Here is the result of a basic open and deskew step in ABBYY FineReader commercial desktop OCR package. Deskewing was more than sufficient for further OCR processing.

    And I did not need to recreate and program my own browser just to post this answer.

    0 讨论(0)
  • 2021-01-30 10:33

    This is not a full solution but there is more than a comment's worth of thoughts.

    You have a margin on the left and right and top and bottom of your image. If you remove that, and even cut into the text in the process, you will still have enough information to align the image. So, if you chop, say 15%, off the top, bottom, left and right, you will have reduced your image area by 50% already - which will speed things up down the line.

    Now take your remaining central area, and divide that into, say 10 strips all of the same height but the full width of the page. Now calculate the mean brightness of those strips and take the 1-4 darkest as they contain the most (black) lettering. Now work on each of those in parallel, or just the darkest. You are now processing just the most interesting 5-20% of the page.

    Here is the command to do that in ImageMagick - it's just my weapon of choice and you can do it just as well in Python.

    convert scan.jpg -crop 300x433+64+92 -crop x10@ -format "%[fx:mean]\n" info:
    
    0.899779
    0.894842
    0.967889
    0.919405
    0.912941
    0.89933
    0.883133    <--- choose 4th last because it is darkest
    0.889992
    0.88894
    0.888865
    

    If I make separate images out of those 10 stripes, I get this

    convert scan.jpg -crop 300x433+64+92 -crop x10@ m-.jpg
    

    and effectively, I do the alignment on the fourth last image rather than the whole image.

    Maybe unscientific, but quite effective and pretty easy to try out.

    Another thought, once you have your procedure/script sorted out for straightening a single image, do not forget you can often get massive speedup by using GNU Parallel to harass all your CPU's lovely, expensive cores simultaneously. Here I specify 8 processes to run in parallel...

    #!/bin/bash
    for ((i=0;i<100000;i++)); do 
       ProcessPage $i
    done | parallel --eta -j 8
    
    0 讨论(0)
  • 2021-01-30 10:36

    Preface: I haven't done much image processing with python. I can give you an image processing suggestion, but you'll have to implement it in Python yourself. All you need is a FFT and a polar transformation (I think OpenCV has an in-built function for that), so that should be straightforward.

    You have only posted one sample image, so I don't know if this works as well for other images, but for this image, a Fourier transform can be very useful: Simply pad the image to a nice power of two (e.g. 2048x2048) and you get a Fourier spectrum like this:

    I've posted a intuitive explanation of the Fourier transform here, but in short: your image can be represented as a series of sin/cosine waves, and most of those "waves" are parallel or perpendicular to the document orientation. That's why you see a strong frequency response at roughly 0°, 90°, 180° and 270°. To measure the exact angle, you could take a polar transform of the Fourier spectrum:

    and simply take the columnwise mean:

    The peak position in that diagram is at 90.835°, and if I rotate the image by -90.835 modulo 90, the orientation looks decent:

    Like I said, I don't have more test images, but it works for rotated versions of your image. At the very least it should narrow down the search space for a more expensive search method.

    Note 1: The FFT is fast, but it obviously takes more time for larger images. And sadly the best way to get a better angle resolution is to use a larger input image (i.e. with more white padding around the source image.)

    Note 2: the FFT actually returns an image where the "DC" (the center in the spectrum image above) is at the origin 0/0. But the rotation property is clearer if you shift it to the center, and it makes the polar transform easier, so I just showed the shifted version.

    0 讨论(0)
提交回复
热议问题