Extract lined table from scanned document opencv python

冷暖自知 提交于 2020-02-23 02:54:30

问题


I want to extract the information from a scanned table and store it a csv. Right now my table extraction algorithm does the following steps.

  1. Apply skew correction
  2. Apply a gaussian filter for denoising.
  3. Do a binarization using Otsu thresholding
  4. Do a morphological opening.
  5. Canny egde detection
  6. Do a hough transform to obtain lines of table.
  7. Remove duplicate lines( same lines in the range of 10 pixels)
  8. filter the horizontal and vertical lines using slope of line(slope should be less than +/-5 degree for horizontal and normal of verticals).

This algorithm is working fine for digital born pdfs and most of the scanned documents. But, Some of the documents have a noisy table and thus its not identifying the lines correctly.

Here is a sample image in which my algorithm fails.

These are the operations I am doing on this table. 1.Gaussian blur

2.Otsu thresholding

3.Morphological opening

4.Canny edge detection

5.filtered lines,as you can see the lines are clearly not identified correctly.

Can anyone please suggest better method for extracting horizontal and vertical lines from this kind of less quality scans.

Thanks in advance!!


回答1:


The problem is and always will be is that you don't have perfect lines. One solution for this approach can be:

  • Threshold image to grayscale as you have done.
  • Now find the largest contour in the image, which will be your table.
  • Now use Floodfill to separate table from the image, by choosing any point on contour to create a flooded mask,



回答2:


I found a perfect solution in this blog. https://medium.com/coinmonks/a-box-detection-algorithm-for-any-image-containing-boxes-756c15d7ed26

Here,We are doing morphological transformations using a vertical kernel to detect vetical lines and horizontal kernel to detect horizontal lines and then combining them to get all the required lines.

Vertical lines

Horizontal lines

required output




回答3:


The problem might be in HoughLinesTransform()

You can try using: HoughLinesTransformP()

For HoughLinesTranform() to work perfectly, the lines need to be perfect. From the image you have provided, you can see the distortion clearly which is clearly causing the method to fail.

Try dilating your image first. Image Dilation in Python.



来源:https://stackoverflow.com/questions/55276042/extract-lined-table-from-scanned-document-opencv-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!