Convert scanned pdf to text python

后端 未结 5 1945
再見小時候
再見小時候 2021-02-02 01:40

I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

\"could not found ghostscript in

5条回答
  •  清歌不尽
    2021-02-02 02:30

    Take a look at my code it is worked for me.

    import os
    import io
    from PIL import Image
    import pytesseract
    from wand.image import Image as wi
    import gc
    
    
    
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    
    imgBlobs=[]
    extracted_text=[]
    
    def Get_text_from_image(pdf_path):
        pdf=wi(filename=pdf_path,resolution=300)
        pdfImg=pdf.convert('jpeg')
        imgBlobs=[]
        extracted_text=[]
        for img in pdfImg.sequence:
            page=wi(image=img)
            imgBlobs.append(page.make_blob('jpeg'))
    
        for imgBlob in imgBlobs:
            im=Image.open(io.BytesIO(imgBlob))
            text=pytesseract.image_to_string(im,lang='eng')
            extracted_text.append(text)
    
        return (extracted_text)
    

    I fixed it for me by editing the /etc/ImageMagick-6/policy.xml and changed the rights for the pdf line to "read|write":

    Open the terminal and change the path

    cd /etc/ImageMagick-6
    nano policy.xml
     
    change to
    
    exit
    

    When i was extracting the pdf images to text i faced some issues please go through the below link

    https://stackoverflow.com/questions/52699608/wand-policy-error- 
    error-constitute-c-readimage-412
    
    https://stackoverflow.com/questions/52861946/imagemagick-not- 
    authorized-to-convert-pdf-to-an-image
    
    Increasing the memory limit  please go through the below link
    enter code here
    https://github.com/phw/peek/issues/112
    https://github.com/ImageMagick/ImageMagick/issues/396
    

提交回复
热议问题