I have a scanned pdf file and I try to extract text from it. I tried to use pypdfocr to make ocr on it but I have error:
\"could not found ghostscript in
Convert pdfs, using pytesseract to do the OCR, and export each page in the pdfs to a text file.
Install these....
conda install -c conda-forge pytesseract
conda install -c conda-forge tesseract
pip install pdf2image
import pytesseract
from pdf2image import convert_from_path
import glob
pdfs = glob.glob(r"yourPath\*.pdf")
for pdf_path in pdfs:
pages = convert_from_path(pdf_path, 500)
for pageNum,imgBlob in enumerate(pages):
text = pytesseract.image_to_string(imgBlob,lang='eng')
with open(f'{pdf_path[:-4]}_page{pageNum}.txt', 'w') as the_file:
the_file.write(text)