Align text for OCR

前端 未结 3 1013
萌比男神i
萌比男神i 2021-01-30 10:08

I am creating a database from historical records which I have as photographed pages from books (+100K pages). I wrote some python code to do some image processing before I OCR e

3条回答
  •  说谎
    说谎 (楼主)
    2021-01-30 10:33

    This is not a full solution but there is more than a comment's worth of thoughts.

    You have a margin on the left and right and top and bottom of your image. If you remove that, and even cut into the text in the process, you will still have enough information to align the image. So, if you chop, say 15%, off the top, bottom, left and right, you will have reduced your image area by 50% already - which will speed things up down the line.

    Now take your remaining central area, and divide that into, say 10 strips all of the same height but the full width of the page. Now calculate the mean brightness of those strips and take the 1-4 darkest as they contain the most (black) lettering. Now work on each of those in parallel, or just the darkest. You are now processing just the most interesting 5-20% of the page.

    Here is the command to do that in ImageMagick - it's just my weapon of choice and you can do it just as well in Python.

    convert scan.jpg -crop 300x433+64+92 -crop x10@ -format "%[fx:mean]\n" info:
    
    0.899779
    0.894842
    0.967889
    0.919405
    0.912941
    0.89933
    0.883133    <--- choose 4th last because it is darkest
    0.889992
    0.88894
    0.888865
    

    If I make separate images out of those 10 stripes, I get this

    convert scan.jpg -crop 300x433+64+92 -crop x10@ m-.jpg
    

    and effectively, I do the alignment on the fourth last image rather than the whole image.

    Maybe unscientific, but quite effective and pretty easy to try out.

    Another thought, once you have your procedure/script sorted out for straightening a single image, do not forget you can often get massive speedup by using GNU Parallel to harass all your CPU's lovely, expensive cores simultaneously. Here I specify 8 processes to run in parallel...

    #!/bin/bash
    for ((i=0;i<100000;i++)); do 
       ProcessPage $i
    done | parallel --eta -j 8
    

提交回复
热议问题