Tesseract and tiff format - spp not in set {1,3}

前端 未结 4 1115
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-06 21:32

While trying to run this command:

tesseract bond111.tif bond111 batch.nochop makebox

I get the next error

Error in pixReadFromT         


        
相关标签:
4条回答
  • 2021-02-06 21:51

    Thanks for your post ZakW, you pointed me to the right direction. Anyhow i also needed to set '-depth 8'. Quality was not good enough for OCR, whatever I tried.

    What worked for me is this solution:

    ghostscript -o document.tiff -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw document.pdf
    tesseract document.tiff document -l deu
    vim document.txt
    

    This way I got perfect text with Umlauts in german.

    0 讨论(0)
  • 2021-02-06 21:56

    You can try using the command 'tiffinfo' provided by libtiff_tools to verify the TIFF format of your src image. A number of TIFF formats exist, with different values for Bits-per-pixel (bpp) and Samples-per-pixel (spp).

    Error in pixReadFromTiffStream: spp not in set {1,3,4}

    An 'spp' value of 2 is invalid for TIFF.

    I solved the problem by saving directly to TIFF format from Gimp, instead of converting from .png to .tif using ImageMagick's 'convert'.

    See also: TIFF format

    0 讨论(0)
  • 2021-02-06 22:18

    It probably means your TIFF image has an alpha channel and therefore the underlying Leptonica library used by Tesseract doesn't support it. If you're using Imagemagick then be aware that operations such as -draw can cause alpha channels to be added. If you're using convert in your workflow and want to remove the channel again immediately, flatten the image before writing by adding -background white -flatten +matte before the output filename, e.g.:

    convert input.tiff -fill white -draw 'rectangle 10,10 20,20' -background white -flatten +matte output.tiff
    

    Tesseract (well, Leptonica) accepts PNGs these days and is less picky about them, so it might be easier to migrate your workflow to PNG anyway.

    Sources: magick-users mailing list posting; tesseract-ocr mailing list posting

    0 讨论(0)
  • 2021-02-06 22:18

    Adjusting the conversion to the following line did help me.

    convert -density 300 input.pdf -depth 8 -background white -alpha Off output.tiff
    

    Note that the other answers did not work for me since they use the deprecated +matte flag instead of -alpha Off.

    0 讨论(0)
提交回复
热议问题