Train Tesseract to label icons

前端 未结 1 866
囚心锁ツ
囚心锁ツ 2021-02-11 09:52

I\'m trying to create training data for Tesseract 4.0 to identify icons (like, comment, share, save) in screenshots. This is a sample screenshot:

1条回答
  •  猫巷女王i
    2021-02-11 10:54

    I've figured it out. The box editor expects single letter/number instead of full words. I have used Unicode character to interpret my icons. The steps are as below:

    1. Crop all target icons that you wish for Tesseract to detect and save it in one file named as (in my case) own.std.exp0.png
    2. Create box file using the command 'tesseract own.std.exp0.png own.std.exp0 makebox'
    3. Open jTessBoxEditor and input unicode at the char column. The list of supported unicode can be found under program Character Map (https://sites.psu.edu/symbolcodes/windows/charmap/). Example: For heart symbol I used U+2665. Note that some unicode are not supported. It shows as blank square. So, keep trying till you find one that works. My final edited box file looks like this.
    4. Create the final training file which will be own.trainneddata (can be done as shown here https://medium.com/apegroup-texts/training-tesseract-for-labels-receipts-and-such-690f452e8f79 or train using jTessBoxEditor).
    5. Copy the own.traineddata to the directory Tesseract/tessdata and run Tesseract using lang='own+eng'. I used pytesseract and the output is as below:

    0 讨论(0)
提交回复
热议问题