Convert scanned pdf to text python

后端 未结 5 1946
再見小時候
再見小時候 2021-02-02 01:40

I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

\"could not found ghostscript in

5条回答
  •  清歌不尽
    2021-02-02 02:17

    Convert pdfs, using pytesseract to do the OCR, and export each page in the pdfs to a text file.

    Install these....

    conda install -c conda-forge pytesseract

    conda install -c conda-forge tesseract

    pip install pdf2image

    import pytesseract
    from pdf2image import convert_from_path
    import glob
    
    pdfs = glob.glob(r"yourPath\*.pdf")
    
    for pdf_path in pdfs:
        pages = convert_from_path(pdf_path, 500)
    
        for pageNum,imgBlob in enumerate(pages):
            text = pytesseract.image_to_string(imgBlob,lang='eng')
    
            with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
                the_file.write(text)
    

提交回复
热议问题