How to extract data from image that contains tabular data?

≯℡__Kan透↙ 提交于 2020-06-11 05:22:13

问题


I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now-

import cv2
import os  
import numpy as np 
import pytesseract
#import pillow 

#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code- 
filename = path   
from pdf2image import convert_from_path 
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')


imgname = "path" 
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR) 
cv2.imshow("original image", oriimg)
cv2.waitKey(0)


#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC) 
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA) 
#here length height  
cv2.imshow("lol", img) 
cv2.waitKey(0) 
cv2.imwrite("changed_dimensionsimgpath", img)


import PIL.Image  
image = cv2.imread(imgname,cv2.IMREAD_COLOR) 
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg = 
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1] 
cv2.imwrite("H://newim.jpg", grayedimg)


pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract- 
OCR\tesseract.exe"


text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)

My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. Any help would be greatly appreciated.

Output of this code for the image sample-

Sun by Select .

F'I‘L‘Mlm":[ [Juir SHIIEF'. ”fillfit Fadll'fi



Brand Type Fragranm Unit: Ithange Dollm 'LChanga Men
Eleanit' Sprayl Grange J.?IEBflI-Efl' 11% '5H'1Elfi9flflfl 35% I E
Eleanlt! kfimnsul' Grange IEEEESWI 39% I521LESM1MH 1113553 ‘ E
Dehuxe F‘mmr [emu 525.940 461% '51:EE?,GED,00 433.6% 5
Datum: Anus»! firing?) 4,3341%} 29% 513573300119 215% E
Dem Spray ‘Drangr: £432,100 09% 515.223.:53000 154%

Min Blaster Aemgul: Dramge ”2114033111 59% :SHSiMMfl H94:

DiFlEIESIEf Sprawl Drama “NEW. 50% ‘5E1D1_E-BDM 141% I
Incredlme Spray Lem 1.513.410" 483% a HELENE] $11143 I E

t“ In

1'"

回答1:


Using cv2 is good after cv2.imwrite(temp_filename, gray_img)

import PIL.Image  
Use config='-psm 6'
page_str = image_to_string(Image.open(temp_filename), lang="eng", config='-psm 6')

This will return good data from table images



来源:https://stackoverflow.com/questions/54178962/how-to-extract-data-from-image-that-contains-tabular-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!