python-tesseract

Tesseract ocr act weird while scalling up image size. How to know which scale factor is best for some particular types of image?

混江龙づ霸主 提交于 2020-06-27 18:35:11
问题 I have this 006.jpg image and i tried following python code I downloaded "eng" from tessdata_best and renamed it to "eng_best" img = cv2.imread(file_path) lang = "eng_best" for img_scale_factor in range (1,8): print(file_path, img_scale_factor) img = cv2.resize(img,None,fx=img_scale_factor,fy=img_scale_factor) hocr_data = pytesseract.image_to_pdf_or_hocr(img, extension="hocr", lang=lang, config="--dpi 1") file_name = '{0:03d}_jpg_{1}_x{3}.{2}'.format(6, lang, "hocr", img_scale_factor) with

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

喜你入骨 提交于 2020-06-27 18:18:12
问题 I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine. Here is the code: from PIL import Image from pytesseract import image_to_string out = image_to_string(Image.open('sample.png')) The error I am getting is : File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string return f.read().strip() File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode return codecs.charmap_decode(input,

UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

▼魔方 西西 提交于 2020-06-27 18:14:39
问题 I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine. Here is the code: from PIL import Image from pytesseract import image_to_string out = image_to_string(Image.open('sample.png')) The error I am getting is : File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string return f.read().strip() File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode return codecs.charmap_decode(input,

Captcha preprocessing and solving with Opencv and pytesseract

时间秒杀一切 提交于 2020-06-24 14:17:45
问题 Problem I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably. Original captcha and result of each preprocessing step Steps as of Now Greyscale and thresholding of image Image enhancing with PIL Convert to TIF and scale to >300px Feed it to Tesseract-OCR (whitelisting all uppercase alphabets) However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to

Captcha preprocessing and solving with Opencv and pytesseract

坚强是说给别人听的谎言 提交于 2020-06-24 14:12:59
问题 Problem I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably. Original captcha and result of each preprocessing step Steps as of Now Greyscale and thresholding of image Image enhancing with PIL Convert to TIF and scale to >300px Feed it to Tesseract-OCR (whitelisting all uppercase alphabets) However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to

Captcha preprocessing and solving with Opencv and pytesseract

萝らか妹 提交于 2020-06-24 14:12:10
问题 Problem I am trying to write code in Python for the Image preprocessing and recognition using Tesseract-OCR. My goal is to solve this form of captcha reliably. Original captcha and result of each preprocessing step Steps as of Now Greyscale and thresholding of image Image enhancing with PIL Convert to TIF and scale to >300px Feed it to Tesseract-OCR (whitelisting all uppercase alphabets) However, I still get an rather incorrect reading (EPQ M Q). What other preprocessing steps can I take to

How to extract data from image that contains tabular data?

可紊 提交于 2020-06-11 05:22:32
问题 I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now- import cv2 import os import numpy as np import

How to extract data from image that contains tabular data?

≯℡__Kan透↙ 提交于 2020-06-11 05:22:13
问题 I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way. Here is what I have done until now- import cv2 import os import numpy as np import

Recognize specific numbers from table image with Pytesseract OCR

試著忘記壹切 提交于 2020-05-15 05:13:12
问题 I want to read a column of number from an attached image (png file). My code is import cv2 import pytesseract import os img = cv2.imread(os.path.join(image_path, image_name), 0) config= "-c tessedit_char_whitelist=01234567890.:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" pytesseract.image_to_string(img, config=config) This code gives me the output string: 'n113\nun\n1.08'. As we can see, there are two problems: It fails to recognize a decimal point in 1.13 (see attached picture). It

Recognize specific numbers from table image with Pytesseract OCR

烂漫一生 提交于 2020-05-15 05:13:11
问题 I want to read a column of number from an attached image (png file). My code is import cv2 import pytesseract import os img = cv2.imread(os.path.join(image_path, image_name), 0) config= "-c tessedit_char_whitelist=01234567890.:ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" pytesseract.image_to_string(img, config=config) This code gives me the output string: 'n113\nun\n1.08'. As we can see, there are two problems: It fails to recognize a decimal point in 1.13 (see attached picture). It