Why GCP Vision API returns worse results in python than at its online demo

问题

I wrote a basic python script to call and use the GCP Vision API. My aim is to send an image of a product to it and to retrieve (with OCR) the words written on this box. I have a predefined list of brands so I can search within the returned text from the API the brand and detect what it is.

My python script is the following:

import  io
from google.cloud import vision
from google.cloud.vision import types
import os
import cv2
import numpy as np

os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "**************************"


def detect_text(file):
    """Detects text in the file."""
    client = vision.ImageAnnotatorClient()

    with io.open(file, 'rb') as image_file:
        content = image_file.read()

    image = types.Image(content=content)

    response = client.text_detection(image=image)
    texts = response.text_annotations
    print('Texts:')

    for text in texts:
        print('\n"{}"'.format(text.description))

        vertices = (['({},{})'.format(vertex.x, vertex.y)
                    for vertex in text.bounding_poly.vertices])

        print('bounds: {}'.format(','.join(vertices)))


file_name = "Image.jpg"
img = cv2.imread(file_name)

detect_text(file_name)

For now, I am experimenting with the following product image: (951 × 335 resolution)

Its brand is Acuvue.

The problem is the following. When I am testing the online demo of GCP Cloud Vision API then I am getting the following text result for this image:

FOR ASTIGMATISM 1-DAY ACUVUE MOIST WITH LACREON™ 30 Lenses BRAND CONTACT LENSES UV BLOCKING

(The json result for this returns all the above words including the word Acuvue which matters for me but the json is too long to post it here)

Therefore, the online demo detects pretty well the text on the product and at least it detects accurately the word Acuvue (which is the brand). However, when I am calling the same API in my python script with the same image I am getting the following result:

Texts:

"1.DAY
FOR ASTIGMATISM
WITH
LACREONTM
MOIS
30 Lenses
BRAND CONTACT LENSES
UV BLOCKING
"
bounds: (221,101),(887,101),(887,284),(221,284)

"1.DAY"
bounds: (221,101),(312,101),(312,125),(221,125)

"FOR"
bounds: (622,107),(657,107),(657,119),(622,119)

"ASTIGMATISM"
bounds: (664,107),(788,107),(788,119),(664,119)

"WITH"
bounds: (614,136),(647,136),(647,145),(614,145)

"LACREONTM"
bounds: (600,151),(711,146),(712,161),(601,166)

"MOIS"
bounds: (378,162),(525,153),(528,200),(381,209)

"30"
bounds: (614,177),(629,178),(629,188),(614,187)

"Lenses"
bounds: (634,178),(677,180),(677,189),(634,187)

"BRAND"
bounds: (361,210),(418,210),(418,218),(361,218)

"CONTACT"
bounds: (427,209),(505,209),(505,218),(427,218)

"LENSES"
bounds: (514,209),(576,209),(576,218),(514,218)

"UV"
bounds: (805,274),(823,274),(823,284),(805,284)

"BLOCKING"
bounds: (827,276),(887,276),(887,284),(827,284)

But this does not detect at all the word "Acuvue" as the demo does!!

Why is this happening?

Can I fix something in my python script to make it work properly?

回答1:

From the docs:

The Vision API can detect and extract text from images. There are two annotation features that support OCR:

TEXT_DETECTION detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.

DOCUMENT_TEXT_DETECTION also extracts text from an image, but the response is optimized for dense text and documents. The JSON includes page, block, paragraph, word, and break information.)

My hope was that the web API was actually using the latter, and then filtering the results based on the confidence.

A DOCUMENT_TEXT_DETECTION response includes additional layout information, such as page, block, paragraph, word, and break information, along with confidence scores for each.

At any rate, I was hoping (and my experience has been) that the latter method would "try harder" to find all the strings.

I don't think you were doing anything "wrong". There are just two parallel detection methods. One (DOCUMENT_TEXT_DETECTION) is more intense, optimized for documents (likely for straightened, aligned and evenly spaced lines), and gives more information that might be unnecessary for some applications.

So I suggested you modify your code following the Python example here.

Lastly, my guess is that the \342\204\242 you ask about are escaped octal values corresponding to utf-8 characters it thinks it found when trying to identify the ™ symbol.

If you use the following snippet:

b = b"\342\204\242"
s = b.decode('utf8')
print(s)

You'll be happy to see that it prints ™.

来源：https://stackoverflow.com/questions/50117524/why-gcp-vision-api-returns-worse-results-in-python-than-at-its-online-demo

标签

python

ocr

google-cloud-vision