Convert scanned pdf to text python

后端未结

关注

 5  1946

再見小時候 2021-02-02 01:40

I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:

\"could not found ghostscript in

5条回答

清歌不尽 (楼主)

2021-02-02 02:17

Convert pdfs, using pytesseract to do the OCR, and export each page in the pdfs to a text file.

Install these....

conda install -c conda-forge pytesseract

conda install -c conda-forge tesseract

pip install pdf2image

import pytesseract
from pdf2image import convert_from_path
import glob

pdfs = glob.glob(r"yourPath\*.pdf")

for pdf_path in pdfs:
    pages = convert_from_path(pdf_path, 500)

    for pageNum,imgBlob in enumerate(pages):
        text = pytesseract.image_to_string(imgBlob,lang='eng')

        with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
            the_file.write(text)

0 讨论(0)

查看其它5个回答