Cache error while doing OCR on a directory of pdf's in python

蹲街弑〆低调 提交于 2021-02-08 10:21:16

问题


I am trying to OCR an entire directory of pdf files using pytesseract and imagemagick but the issue is that imagemagick is consuming all my Temp folder space and finally I'm getting a cache error i.e "CacheError: unable to extend cache 'C:/Users/Azu/AppData/Local/Temp/magick-18244WfgPyAToCsau11': No space left on device @ error/cache.c/OpenPixelCache/3883" I have also written a code to delete the temp folder content once OCR'd but still facing the same issue.

Here's the code till now:

import io
import os
import glob
from PIL import Image
import pytesseract
from wand.image import Image as wi


files = glob.glob(r"D:\files\**")
tempdir = r"C:\Users\Azu\AppData\Local\Temp"
filesall = os.listdir(tempdir) 
for file in files:
    name = os.path.basename(file).split('.')[0]
    #print(file)
    pdf = wi(filename = file, resolution = 300)

    pdfImg = pdf.convert('jpeg')

    imgBlobs = []

    for img in pdfImg.sequence:
        page = wi(image = img)
        imgBlobs.append(page.make_blob('jpeg'))

    extracted_texts = []

    for imgBlob in imgBlobs:
            im = Image.open(io.BytesIO(imgBlob))
            text = pytesseract.image_to_string(im, lang = 'eng')
            extracted_texts.append(text)


    with open("D:\\extracted_text\\"+ name + ".txt", 'w') as f:
        f.write(str(extracted_texts))

for ifile in filesall:
        if "magick" in ifile:
            os.remove(os.path.join(tempdir,ifile))

来源:https://stackoverflow.com/questions/56454582/cache-error-while-doing-ocr-on-a-directory-of-pdfs-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!