I wrote a basic python script to call and use the GCP Vision API. My aim is to send an image of a product to it and to retrieve (with OCR) the words written on this box. I h
From the docs:
The Vision API can detect and extract text from images. There are two annotation features that support OCR:
TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.
DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.)
My hope was that the web API was actually using the latter, and then filtering the results based on the confidence.
A DOCUMENT_TEXT_DETECTION response includes additional layout information, such as page, block, paragraph, word, and break information, along with confidence scores for each.
At any rate, I was hoping (and my experience has been) that the latter method would "try harder" to find all the strings.
I don't think you were doing anything "wrong". There are just two parallel detection methods. One (DOCUMENT_TEXT_DETECTION) is more intense, optimized for documents (likely for straightened, aligned and evenly spaced lines), and gives more information that might be unnecessary for some applications.
So I suggested you modify your code following the Python example here.
Lastly, my guess is that the \342\204\242
you ask about are escaped octal values corresponding to utf-8 characters it thinks it found when trying to identify the ™ symbol.
If you use the following snippet:
b = b"\342\204\242"
s = b.decode('utf8')
print(s)
You'll be happy to see that it prints ™.