UnicodeDecodeError 'charmap' codec with Tesseract OCR in Python

▼魔方 西西 提交于 2020-06-27 18:14:39

问题


I am trying to do OCR on an image file in python using teseract-OCR. My environment is- Python 3.5 Anaconda on Windows Machine.

Here is the code:

from PIL import Image
from pytesseract import image_to_string
out = image_to_string(Image.open('sample.png'))

The error I am getting is :

File "Anaconda3\lib\sitepackages\pytesseract\pytesseract.py", line 167, in image_to_string
return f.read().strip()
File "Anaconda3\lib\encodings\cp1252.py", line 23 in decode
return codecs.charmap_decode(input, self.errors, decoding_table)[0]
UnicodeDecodeError:'charmap' codec can't decode byte 0x81 in position 1583: character maps to <undefined>

I have tried the solution mentioned here The hack is not working

I have tried my code on Mac OS it is working.

I have looked into the pytesseract issues: Here is this an open issue

Thanks


回答1:


Hmm..something very weird going on there - The character "\x81" is unprintable when we talk about the "latin1" text encoding. However, on the "cp1252" encoding the library is using, it is mapped instead to an "undefined character" - this is explicit.

What happens is that "latin1" is somewhat a "no-op" codec, used sometimes in Python to simply translate a byte sequence to an unicode string (the default string in Python 3.x). The codec "cp1252" is almost the samething, and in some contexts it is used interchangeable with latin1 - but this "\x81" code is one difference between the two. In your case, a crucial one.

The correct thing to do there is try to supply the image_to_string function with the optional lang parameter - so that it might use the correct codec to decode your text - if it recognizes better what is the character it is exposing as "0x81". However, this might not work - as it might simply be an OCR error to a very weird character not related to the language at all.

So, the workaround for you is to monkeypatch the "cp1252" codec so that instead of an error, it fills in an Unicode "unrecognized" character - one way to do that is to isnert these lines before calling tesseract:

from encodings import cp1252
original_decode  = cp1252.Codec.decode
cp1252.Codec.decode =  lambda self, input, errors="replace": original_decode(self, input, errors)

But please, if you can, open a bug report against the pytesseract project. My guess is they should be using "latin1" and not "cp1252" encoding at this point.



来源:https://stackoverflow.com/questions/38023947/unicodedecodeerror-charmap-codec-with-tesseract-ocr-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!