Tesseract OCR on AWS Lambda via virtualenv

前端 未结 4 1344
终归单人心
终归单人心 2020-11-30 22:36

I have spent all week attempting this, so this is a bit of a hail mary.

I am attempting to package up Tesseract OCR into AWS Lambda running on Python (I am also usin

相关标签:
4条回答
  • 2020-11-30 23:09

    Adapatations for tesseract 4:

    Tesseract offers much improvements in version 4, thanks to a neural network. I've tried it with some scans and the improvements are quite substantial. Plus the whole package was 25% smaller in my case. Planned release date of version 4 is first half of 2018.

    The build steps are similar to tesseract 3 with some tweaks, that's why I wanted to share them in full. I also made a github repo with ready made binary files (most of it is based on Jose's post above, which was very helpful), plus a blog post how to use it as a processing step after a raspberrypi3 powered scanner step.

    To compile the tesseract4 binaries, do these steps on a fresh 64bit AWS AIM instance:

    Compile leptonica

    cd ~
    sudo yum install clang -y
    sudo yum install libpng-devel libtiff-devel zlib-devel libwebp-devel libjpeg-turbo-devel -y
    wget https://github.com/DanBloomberg/leptonica/releases/download/1.75.1/leptonica-1.75.1.tar.gz
    tar -xzvf leptonica-1.75.1.tar.gz
    cd leptonica-1.75.1
    ./configure && make && sudo make install
    

    Compile autoconf-archive

    Unfortunately, since some weeks tesseract needs autoconf-archive, which is not available for amazon AIMs, so you'd need to compile it on your own:

    cd ~
    wget http://mirror.switch.ch/ftp/mirror/gnu/autoconf-archive/autoconf-archive-2017.09.28.tar.xz
    tar -xvf autoconf-archive-2017.09.28.tar.xz
    cd autoconf-archive-2017.09.28
    ./configure && make && sudo make install
    sudo cp m4/* /usr/share/aclocal/
    

    Compile tesseract

    cd ~
    sudo yum install git-core libtool pkgconfig -y
    git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
    cd tesseract-ocr
    export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
    ./autogen.sh
    ./configure
    make
    sudo make install
    

    Get all needed files and zip

    cd ~
    mkdir tesseract-standalone
    cd tesseract-standalone
    cp /usr/local/bin/tesseract .
    mkdir lib
    cp /usr/local/lib/libtesseract.so.4 lib/
    cp /usr/local/lib/liblept.so.5 lib/
    cp /usr/lib64/libjpeg.so.62 lib/
    cp /usr/lib64/libwebp.so.4 lib/
    cp /usr/lib64/libstdc++.so.6 lib/
    mkdir tessdata
    cd tessdata
    wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/osd.traineddata
    wget https://github.com/tesseract-ocr/tessdata_fast/raw/master/eng.traineddata
    # additionally any other language you want to use, e.g. `deu` for Deutsch
    mkdir configs
    cp /usr/local/share/tessdata/configs/pdf configs/
    cp /usr/local/share/tessdata/pdf.ttf .
    cd ..
    zip -r ~/tesseract-standalone.zip *
    
    0 讨论(0)
  • 2020-11-30 23:10

    Check this medium article on how to setup Tesseract 4.0.0 in lambda using Docker. It shows also how to convert python packages into layers

    0 讨论(0)
  • 2020-11-30 23:14

    The reason it's not working is because these python packages are only wrappers to tesseract. You have to compile tesseract using a AWS Linux instance and copy the binaries and libraries to the zip file of the lambda function.

    1) Start an EC2 instance with 64-bit Amazon Linux;

    2) Install dependencies:

    sudo yum install gcc gcc-c++ make
    sudo yum install autoconf aclocal automake
    sudo yum install libtool
    sudo yum install libjpeg-devel libpng-devel libpng-devel libtiff-devel zlib-devel
    

    3) Compile and install leptonica:

    cd ~
    mkdir leptonica
    cd leptonica
    wget http://www.leptonica.com/source/leptonica-1.73.tar.gz
    tar -zxvf leptonica-1.73.tar.gz
    cd leptonica-1.73
    ./configure
    make
    sudo make install
    

    4) Compile and install tesseract

    cd ~
    mkdir tesseract
    cd tesseract
    wget https://github.com/tesseract-ocr/tesseract/archive/3.04.01.tar.gz
    tar -zxvf 3.04.01.tar.gz
    cd tesseract-3.04.01
    ./autogen.sh
    ./configure
    make
    sudo make install
    

    5) Download language traineddata to tessdata

    cd /usr/local/share/tessdata
    wget https://github.com/tesseract-ocr/tessdata/raw/3.04.00/eng.traineddata
    export TESSDATA_PREFIX=/usr/local/share/
    

    At this point you should be able to use tesseract on this EC2 instance. To copy the binaries of tesseract and use it on a lambda function you will need to copy some files from this instance to the zip file you upload to lambda. I'll post all the commands to get a zip file with all the files you need.

    6) Zip all the stuff you need to run tesseract on lambda

    cd ~
    mkdir tesseract-lambda
    cd tesseract-lambda
    cp /usr/local/bin/tesseract .
    mkdir lib
    cd lib
    cp /usr/local/lib/libtesseract.so.3 .
    cp /usr/local/lib/liblept.so.5 .
    cp /usr/lib64/libpng12.so.0 .
    cd ..
    
    mkdir tessdata
    cd tessdata
    cp /usr/local/share/tessdata/eng.traineddata .
    cd ..
    
    cd ..
    zip -r tesseract-lambda.zip tesseract-lambda
    

    The tesseract-lambda.zip file have everything lambda needs to run tesseract. The last thing to do is add the lambda function at the root of the zip file and upload it to lambda. Here is an example that I have not tested, but should work.

    7) Create a file named main.py, write a lambda function like the one above and add it on the root of tesseract-lambda.zip:

    from __future__ import print_function
    
    import urllib
    import boto3
    import os
    import subprocess
    
    SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
    LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')
    
    s3 = boto3.client('s3')
    
    def lambda_handler(event, context):
    
        # Get the bucket and object from the event
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
    
        try:
            print("Bucket: " + bucket)
            print("Key: " + key)
    
            imgfilepath = '/tmp/image.png'
            jsonfilepath = '/tmp/result.txt'
            exportfile = key + '.txt'
    
            print("Export: " + exportfile)
    
            s3.download_file(bucket, key, imgfilepath)
    
            command = 'LD_LIBRARY_PATH={} TESSDATA_PREFIX={} {}/tesseract {} {}'.format(
                LIB_DIR,
                SCRIPT_DIR,
                SCRIPT_DIR,
                imgfilepath,
                jsonfilepath,
            )
    
            try:
                output = subprocess.check_output(command, shell=True)
                print(output)
                s3.upload_file(jsonfilepath, bucket, exportfile)
            except subprocess.CalledProcessError as e:
                print(e.output)
    
        except Exception as e:
            print(e)
            print('Error processing object {} from bucket {}.'.format(key, bucket))
            raise e
    

    When creating the AWS Lambda function on the AWS Console, upload the zip file and set the Hanlder to main.lambda_handler. This will tell AWS Lambda to look for the main.py file inside the zip and to call the function lambda_handler.

    IMPORTANT

    From time to time things change in AWS Lambda's environment. For example, the current image for the lambda env is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 (it might not be this one when you read this answer). If tesseract starts to return segmentation fault, run "ldd tesseract" on the Lambda function and see the output for what libs are needed (currently libtesseract.so.3 liblept.so.5 libpng12.so.0).

    Thanks for the comment, SergioArcos.

    0 讨论(0)
  • 2020-11-30 23:19

    Generate zip files using shell scripts to compile code Tesseract 4 for Python 3.7

    I have been struggling through this issue for a few days trying to get Tesseract 4 to work on a Python 3.7 Lambda function. Finally I found this article and GitHub which describes how to generate zip files for tesseract, pytesseract, opencv, and pillow using shell scripts that generate the necessary .zip files using Docker images on EC2! This process takes less than 20 minutes using these steps and is reliably reproducible.

    Summarized Steps:

    Start an Amazon Linux EC2 instance (t2 micro will do just fine)

    sudo yum update
    sudo yum install git-core -y
    sudo yum install docker -y
    sudo service docker start
    sudo usermod -a -G docker ec2-user #allows ec2-user to call docker
    

    After running the 5th command you will need to logout and log back in for the change to take effect.

    git clone https://github.com/amtam0/lambda-tesseract-api.git
    cd lambda-tesseract-api/
    bash build_tesseract4.sh #takes a few minutes
    bash build_py37_pkgs.sh
    

    This will generate .zip files for tesseract, pytesseract, pillow, and opencv. In order to use with lambda you need to complete two more steps.

    1. Create Lambda layers, one for each zip file, and attach the layers to your Lambda function.
    2. Create an Environment Variable. Key : PYTHONPATH and Value : /opt/

    (Note: you will probably need to increase your Memory allocation and Timeout)

    At this point you are all set to upload your code and start using Tesseract on AWS Lambda! Refer back to the Medium article for a test script.

    0 讨论(0)
提交回复
热议问题