How to remove duplicate objects in PDF using ghostscript?

霸气de小男生 提交于 2021-02-07 08:40:55

问题


Using command-line ghostscript, is it possible to remove duplicate embedded objects (images) in the PDF and replace them with a single instance?

I have a 200+ pages PDF with a background image and some smaller logos on each page. The file is very large, because the very same background image and logo binaries are embedded in each individual page, instead of being embedded once and then referenced on each page. I am not the creator of the PDF so I can not solve the problem at it's source.

(I do not want to shrink or reduce the image quality, and I do not want delete them completely.)


回答1:


No, ghostscript (more specifically the pdfwrite device) won't replace image XObjects or inline images, it doesn't test them to see if tehy are identical.

It would be possible to do so, but it means checking every byte of each image, which can be very expensive on performance, so we don't do it at the moment. If you want to have a go at modifying the source I can give some suggestions on where to start.

FWIW many other objects are tested for duplicates, but not images, simply because of the time taken to read and hash large images.




回答2:


As supplement to ghostscript, pdfsizeopt does a very good job in eliminating duplicate embedded objects (including background images) in the PDF and can be run in addition before or after a file is processed by ghostscript. A bit tricky to include in the workflow due it's dependencies however, and creates a lot of temporary files. Can be found at https://github.com/pts/pdfsizeopt (formerly https://code.google.com/p/pdfsizeopt/)

My 200+ pages document got from 150MB to 40MB just by removing duplicate images.



来源:https://stackoverflow.com/questions/27295777/how-to-remove-duplicate-objects-in-pdf-using-ghostscript

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!