Special characters which are identified as individual word in google Vision OCR?

做~自己de王妃 提交于 2019-12-13 02:15:27

问题


I was trying to make the google vision OCR regex searchable. I have completed it and works pretty well when the document contains only English characters. But it fails when there is the text of other languages.

It's happening because I have only English characters in google vision word component as follows.

VISION_API_WORD_COUNTERS = "([a-zA-Z0-9]+)|([^a-zA-Z0-9 ])";
VISION_API_WORD_COMPONENTS = "[a-zA-Z0-9]";
VISION_API_NOT_WORD_COMPONENTS = "[^a-zA-Z0-9]";

As I can't include characters from all the languages, I am thinking to include the inverse of above. Something like

VISION_API_WORD_COMPONENTS = "[^*ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS WORD BY GOOGLE VISION*]"

for example [^!@#$%^&*()_+=].

So where can I find ALL THE SPECIAL CHARACTERS WHICH ARE IDENTIFIED AS A SEPARATE WORD BY GOOGLE VISION?

Trial and error, keep adding the special characters I find is one option.But that would be my last option.

来源:https://stackoverflow.com/questions/52829583/special-characters-which-are-identified-as-individual-word-in-google-vision-ocr

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!