Converting PDF to images automatically

前端未结

关注

 6  1372

So the state I\'m in released a bunch of data in PDF form, but to make matters worse, most (all?) of the PDFs appear to be letters typed in Office, printed/fax, and then sca

相关标签:

6条回答

北海茫月

2020-12-08 02:59

Here's an alternative approach to turning a .pdf file into images: Use an image printer. I've successfully used the function below to "print" pdf's to jpeg images with ImagePrinter Pro. However, there are MANY image printers out there. Pick the one you like. Some of the code may need to be altered slightly based on the image printer you pick and the standard file saving format that image printer uses.

import win32api
import os

def pdf_to_jpg(pdfPath, pages):
    # print pdf using jpg printer
    # 'pages' is the number of pages in the pdf
    filepath = pdfPath.rsplit('/', 1)[0]
    filename = pdfPath.rsplit('/', 1)[1]

    #print pdf to jpg using jpg printer
    tempprinter = "ImagePrinter Pro"
    printer = '"%s"' % tempprinter
    win32api.ShellExecute(0, "printto", filename, printer,  ".",  0)

    # Add time delay to ensure pdf finishes printing to file first
    fileFound = False
    if pages > 1:
        jpgName = filename.split('.')[0] + '_' + str(pages - 1) + '.jpg'
    else:
        jpgName = filename.split('.')[0] + '.jpg'
    jpgPath = filepath + '/' + jpgName
    waitTime = 30
    for i in range(waitTime):
        if os.path.isfile(jpgPath):
            fileFound = True
            break
        else:
            time.sleep(1)

    # print Error if the file was never found
    if not fileFound:
        print "ERROR: " + jpgName + " wasn't found after " + str(waitTime)\
              + " seconds"

    return jpgPath

The resulting jpgPath variable tells you the path location of the last jpeg page of the pdf printed. If you need to get another page, you can easily add some logic to modify the path to get prior pages

0 讨论(0)

面向向阳花

2020-12-08 03:00
Ghostscript is ideal for converting PDF files to images. It is reliable and has many configurable options. Its also available under the GPL license or commercial license. You can call it from the command line or use its native API. For more information:
- Ghostscript Main Website
- Ghostscript docs on Command line usage
- Another stackoverflow thread that provides some examples of invoking Ghostscript's command line interface from Python
- Ghostscript API Documentation
0 讨论(0)
发布评论:

提交评论
- 加载中...
别跟我提以往

2020-12-08 03:03

You could call e.g. pdftoppm from the command-line (or using Python's subprocess module) and then convert the resulting PPM files to the desired format using e.g. ImageMagick (again, using subprocess or some bindings if they exist).

0 讨论(0)
发布评论:

提交评论
- 加载中...

生来不讨喜

2020-12-08 03:06

in pdf_to_jpg(pdfPath)

      6     # 'pages' is the number of pages in the pdf
      7     filepath = pdfPath.rsplit('/', 1)[0]
----> 8     filename = pdfPath.rsplit('/', 1)[1]
      9 
     10     #print pdf to jpg using jpg printer

IndexError: list index out of range

0 讨论(0)

耶瑟儿～

2020-12-08 03:09

If the PDFs are truly scanned images, then you shouldn't convert the PDF to an image, you should extract the image from the PDF. Most likely, all of the data in the PDF is essentially one giant image, wrapped in PDF verbosity to make it readable in Acrobat.

You should try the simple expedient of simply finding the image in the PDF, and copying the bytes out: Extracting JPGs from PDFs. The code there is dead simple, and there are probably dozens of reasons it won't work on your PDF files. But if it does, you'll have a quick and painless way to get the image data out of the PDF files.

0 讨论(0)
发布评论:

提交评论
- 加载中...

滥情空心

2020-12-08 03:18

With Wand there are now excellent imagemagick bindings for Python that make this a very easy task.

Here is the code necessary for converting a single PDF file into a sequence of PNG images:

from wand.image import Image

input_path = "name_of_file.pdf"
output_name = "name_of_outfile_{index}.png"
source = Image(filename=upload.original.path, resolution=300, width=2200)
images = source.sequence
for i in range(len(images)):
    Image(images[0]).save(filename=output_name.format(i))

0 讨论(0)