问题
Using command-line ghostscript, is it possible to remove duplicate embedded objects (images) in the PDF and replace them with a single instance?
I have a 200+ pages PDF with a background image and some smaller logos on each page. The file is very large, because the very same background image and logo binaries are embedded in each individual page, instead of being embedded once and then referenced on each page. I am not the creator of the PDF so I can not solve the problem at it's source.
(I do not want to shrink or reduce the image quality, and I do not want delete them completely.)
回答1:
No, ghostscript (more specifically the pdfwrite device) won't replace image XObjects or inline images, it doesn't test them to see if tehy are identical.
It would be possible to do so, but it means checking every byte of each image, which can be very expensive on performance, so we don't do it at the moment. If you want to have a go at modifying the source I can give some suggestions on where to start.
FWIW many other objects are tested for duplicates, but not images, simply because of the time taken to read and hash large images.
回答2:
As supplement to ghostscript, pdfsizeopt
does a very good job in eliminating duplicate embedded objects (including background images) in the PDF and can be run in addition before or after a file is processed by ghostscript. A bit tricky to include in the workflow due it's dependencies however, and creates a lot of temporary files. Can be found at https://github.com/pts/pdfsizeopt (formerly https://code.google.com/p/pdfsizeopt/)
My 200+ pages document got from 150MB to 40MB just by removing duplicate images.
来源:https://stackoverflow.com/questions/27295777/how-to-remove-duplicate-objects-in-pdf-using-ghostscript