Convert scanned pdf to text python

后端 未结 5 1956
再見小時候
再見小時候 2021-02-02 01:40

I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

\"could not found ghostscript in

相关标签:
5条回答
  • 2021-02-02 02:17

    Convert pdfs, using pytesseract to do the OCR, and export each page in the pdfs to a text file.

    Install these....

    conda install -c conda-forge pytesseract

    conda install -c conda-forge tesseract

    pip install pdf2image

    import pytesseract
    from pdf2image import convert_from_path
    import glob
    
    pdfs = glob.glob(r"yourPath\*.pdf")
    
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)
    
        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')
    
            with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
                the_file.write(text)
    
    0 讨论(0)
  • 2021-02-02 02:27

    PyPDF2 is a python library built as a PDF toolkit. It is capable of:

    Extracting document information (title, author, …)
    Splitting documents page by page
    Merging documents page by page
    Cropping pages
    Merging multiple pages into a single page
    Encrypting and decrypting PDF files
    and more!
    

    To install PyPDF2, run following command from command line:

    pip install PyPDF2
    

    CODE:

    import PyPDF2 
    
    pdfFileObj = open('myPdf.pdf', 'rb') 
    
    
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
    
    print(pdfReader.numPages) 
    
    pageObj = pdfReader.getPage(0) 
    
    print(pageObj.extractText()) 
    
    pdfFileObj.close() 
    
    0 讨论(0)
  • 2021-02-02 02:27

    You can use OpenCV for python. There are a lot of examples about detection of text.

    0 讨论(0)
  • 2021-02-02 02:30

    Take a look at my code it is worked for me.

    import os
    import io
    from PIL import Image
    import pytesseract
    from wand.image import Image as wi
    import gc
    
    
    
    pdf=wi(filename=pdf_path,resolution=300)
    pdfImg=pdf.convert('jpeg')
    
    imgBlobs=[]
    extracted_text=[]
    
    def Get_text_from_image(pdf_path):
        pdf=wi(filename=pdf_path,resolution=300)
        pdfImg=pdf.convert('jpeg')
        imgBlobs=[]
        extracted_text=[]
        for img in pdfImg.sequence:
            page=wi(image=img)
            imgBlobs.append(page.make_blob('jpeg'))
    
        for imgBlob in imgBlobs:
            im=Image.open(io.BytesIO(imgBlob))
            text=pytesseract.image_to_string(im,lang='eng')
            extracted_text.append(text)
    
        return (extracted_text)
    

    I fixed it for me by editing the /etc/ImageMagick-6/policy.xml and changed the rights for the pdf line to "read|write":

    Open the terminal and change the path

    cd /etc/ImageMagick-6
    nano policy.xml
    <policy domain="coder" rights="read" pattern="PDF" /> 
    change to
    <policy domain="coder" rights="read|write" pattern="PDF" />
    exit
    

    When i was extracting the pdf images to text i faced some issues please go through the below link

    https://stackoverflow.com/questions/52699608/wand-policy-error- 
    error-constitute-c-readimage-412
    
    https://stackoverflow.com/questions/52861946/imagemagick-not- 
    authorized-to-convert-pdf-to-an-image
    
    Increasing the memory limit  please go through the below link
    enter code here
    https://github.com/phw/peek/issues/112
    https://github.com/ImageMagick/ImageMagick/issues/396
    
    0 讨论(0)
  • 2021-02-02 02:32

    Take a look at this library: https://pypi.python.org/pypi/pypdfocr but a PDF file can have also images in it. You may be able to analyse the page content streams. Some scanners break up the single scanned page into images, so you won't get the text with ghostscript.

    0 讨论(0)
提交回复
热议问题