问题
I'm trying to create training data for Tesseract 4.0 to identify icons (like, comment, share, save) in screenshots. This is a sample screenshot:
I would like to fine tune the Tesseract to achieve output as below:
Like 147
Comment 29
Saved 5
Actions
58
Actions
Profile Visits 24
Follows 2
I have followed step-by-step as stated in https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/
I modified the box file as below:
- Heart : Like
- Speech bubble: Comment
- Bookmark: Saved
- Arrow: Share
But, the final training data failed to read the icon as I wanted. Example of error I've got is 'Like is not in unicharset'. Do I have to do something different when creating the unicharset for icons?
回答1:
I've figured it out. The box editor expects single letter/number instead of full words. I have used Unicode character to interpret my icons. The steps are as below:
- Crop all target icons that you wish for Tesseract to detect and save it in one file named as (in my case) own.std.exp0.png
- Create box file using the command 'tesseract own.std.exp0.png own.std.exp0 makebox'
- Open jTessBoxEditor and input unicode at the char column. The list of supported unicode can be found under program Character Map (https://sites.psu.edu/symbolcodes/windows/charmap/). Example: For heart symbol I used U+2665. Note that some unicode are not supported. It shows as blank square. So, keep trying till you find one that works. My final edited box file looks like this.
- Create the final training file which will be own.trainneddata (can be done as shown here https://medium.com/apegroup-texts/training-tesseract-for-labels-receipts-and-such-690f452e8f79 or train using jTessBoxEditor).
- Copy the own.traineddata to the directory Tesseract/tessdata and run Tesseract using lang='own+eng'. I used pytesseract and the output is as below:
来源:https://stackoverflow.com/questions/57995023/train-tesseract-to-label-icons