Align text for OCR

前端未结

关注

 3  1013

萌比男神i 2021-01-30 10:08

I am creating a database from historical records which I have as photographed pages from books (+100K pages). I wrote some python code to do some image processing before I OCR e

3条回答

说谎 (楼主)

2021-01-30 10:33
This is not a full solution but there is more than a comment's worth of thoughts.

You have a margin on the left and right and top and bottom of your image. If you remove that, and even cut into the text in the process, you will still have enough information to align the image. So, if you chop, say 15%, off the top, bottom, left and right, you will have reduced your image area by 50% already - which will speed things up down the line.

Now take your remaining central area, and divide that into, say 10 strips all of the same height but the full width of the page. Now calculate the mean brightness of those strips and take the 1-4 darkest as they contain the most (black) lettering. Now work on each of those in parallel, or just the darkest. You are now processing just the most interesting 5-20% of the page.

Here is the command to do that in ImageMagick - it's just my weapon of choice and you can do it just as well in Python.
```
convert scan.jpg -crop 300x433+64+92 -crop x10@ -format "%[fx:mean]\n" info:

0.899779
0.894842
0.967889
0.919405
0.912941
0.89933
0.883133    <--- choose 4th last because it is darkest
0.889992
0.88894
0.888865
```
If I make separate images out of those 10 stripes, I get this
```
convert scan.jpg -crop 300x433+64+92 -crop x10@ m-.jpg
```
and effectively, I do the alignment on the fourth last image rather than the whole image.

Maybe unscientific, but quite effective and pretty easy to try out.

Another thought, once you have your procedure/script sorted out for straightening a single image, do not forget you can often get massive speedup by using GNU Parallel to harass all your CPU's lovely, expensive cores simultaneously. Here I specify 8 processes to run in parallel...
```
#!/bin/bash
for ((i=0;i<100000;i++)); do 
   ProcessPage $i
done | parallel --eta -j 8
```
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...