Detect table with OpenCV

后端 未结 1 1457
无人共我
无人共我 2020-12-03 03:47

I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can

相关标签:
1条回答
  • 2020-12-03 04:41

    This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.

    For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:

    1. Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.

    2. Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

    3. Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

    In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.

    1. Simple image binarization will remove the background content leaving only necessary content as shown here.
    2. Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.

    3. For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:

      Caption title

      header! header2 header3

      row1cell1 row1cell2 row1cell3

      row2cell1 row2cell2 row2cell3

    4. As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.

    After post processing the OCR result it can be parsed to read the row and column values.

    Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.

    0 讨论(0)
提交回复
热议问题