PDF compressing library/tool

问题

I am working on a project to reduce the size of the PDF's, compress them. I am wondering are there any good tools/library (.NET) in market that are really good. I did try few tools like Onstream Compression, but the results were not satisfactory.

回答1:

Some additional (mega-)bytes can easily be squeezed out of PDFs. E.g., is a well known "PDF32000_2008.pdf" optimized enough? File size is 8,995,189 bytes. It uses object and xref streams, (nearly) no images, everything is packed tight. Or is it not?

Look at a page dictionary:

Dict:9 [1 0 R]
.   /Annots Array:3
.   /Contents Stream:3 [2 0 R]
.   /CropBox Array:4
.   /MediaBox Array:4
.   /Parent Dict:4 [124248 0 R]
.   /Resources Dict:4
.   /Rotate 0 (Number)
.   /StructParents 2 (Number)
.   /Type Page (Name)

Rotate 0 is a default, why is it there? What is CropBox there for? It defaults to MediaBox, and there's no page in this document with CropBox other than MediaBox. Why is MediaBox there? It's inheritable, all pages are the same size, so move it to Pages tree root! There are 756 pages, i.e. redundant (or useless) information replicated 756 times.

Look at typical Annotation dictionary:

Dict:6 [3548 0 R]
.   /A Dict:2
.   .   /S URI (Name)
.   .   /URI http://www.iso.org/iso/iso_catalogue/... (String)
.   /Border Array:3
.   .   [0] 0 (Number)
.   .   [1] 0 (Number)
.   .   [2] 0 (Number)
.   /Rect Array:4
.   .   [0] 82.14 (Number)
.   .   [1] 576.8 (Number)
.   .   [2] 137.1 (Number)
.   .   [3] 587.18 (Number)
.   /StructParent 3 (Number)
.   /Subtype Link (Name)
.   /Type Annot (Name)

There are thousands (maybe > 10'000?) link annotations in this document. /Type key is optional, why is it there? They are invisible rectangles, do you think their placement precision other than whole number of points is relevant? Round it to integer.

Look at the fragment of typical page content stream, text showing operator:

[(w)7(ed)-6( b)21(u)1(t shal)-6(l no)-6(t b)-6(e)1( ed)-6(ite)-6(d)1( un)-6(less the typef)23(aces wh)-6(ich )]TJ

Kerning of less than some value is all but invisible. This value may be debated, it's like JPEG compression quality level - acceptable to some, others disagree. I think that very conservative estimate (i.e. retaining most quality), with effect invisible to general person, is that kerning of absolute value less than 10 may be omitted. (Care must be taken to preserve justification, of course). (And I don't even mention that there are files out there with fractional kerning with precision of 3-6 decimal places! But not in this file)

And, with optimizations mentioned above, file size became 7,982,478 bytes. One megabyte shaved off. And it's certainly not the limit, there maybe others, that are hidden better, sources of optimization.

回答2:

To add a few more notes to already good answers, there are a whole range of applications / libraries that will reduce the file size of PDF files. The first question, going along with @Jongware's answer, is whether anything can be done to begin with.

If your PDF files are coming from everywhere (you have no control over the source), gather a sample of files and determine what your requirements for the resulting PDFs are. If you only want to show them on screen for example, you have the option to resample images to a much lower resolution (be careful, that isn't the case any more for mobile use necessarily). If the PDFs are all internal you have it easier, because you can inspect them and see where you could save.

Use Adobe Acrobat's "Space Audit" feature. Adobe seems to find satisfaction in hiding this nice tool and moving it around between versions of Acrobat, but in Acrobat Pro XI it can be found by opening a PDF file and then selecting "File > Save as other > Optimized PDF..." (not "Reduced size PDF" as you would think). In the dialog window that shows up there's an "Audit space usage" button that will bring up an information window showing how much space elements in the PDF are using.

Depending on what you find there, there are multiple things you can do, most are already mentioned but here's an incomplete list:

Downsample images.
Change color spaces of images from CMYK to RGB. Be cautious about this as it will a) not provide the space savings you might think (because of compression) and b) might actually be counter-productive if you're unlucky (because of indexing and other neat image tricks).
Remove document and object level metadata (some sample sets of magazine page files I have contain more metadata than actual content).
Remove proprietary application data (Illustrator has a nasty habit of embedding the complete Illustrator document into a PDF file if you're not careful).
Compress object streams and XRef tables if you're sure all readers you're using will be able to handle that.
Use optimal compression IF your target readers will handle that (JBIG2, JPEG2000...)
Optimize the file structure (some bad PDF files don't optimise fonts and other objects and will have multiple copies scattered throughout the file).
Subset all fonts in the document.
Remove ICC profiles if they're not needed.

If you want to perform these tasks, there are many tools that can help. Either libraries to let you implement this yourself or commercial (and probably other) tools that will work though command-line with predefined actions. callas pdfToolbox is one of these tools (I'm connected to this company!), Enfocus PitStop has functionality in this area, Apago also has functionality here (though I'm not sure they have a command-line version of the top of my head).

回答3:

@Jongware is right. It's not likely that you will be able to significantly reduce size of a properly created PDF file.

But many PDFs in the wild can be compressed better. It's because many PDFs do not use object and cross-reference streams introduced in newer version of PDF Specification. Also, PDFs often contain unused objects that can be safely removed. And yes, images in PDFs can be resized / recompressed to further reduce size of a PDF.

If you are fine with commercial solutions then you might be interested in my answer to similar question. The answer contains code that shows how to compress PDFs with Docotic.Pdf library (I am one of developers of the library).

回答4:

There is a PDFBeads Ruby gem.

It works with RubyInstaller 2.3.3 32-bit with DevKit. (Higher versions require unnecessarily large MSYS2 DevKit.)

For Windows these programs are needed:

ImageMagick 6.9.x 32-bit dll version with C/C++ development headers (http://ftp.icm.edu.pl/pub/graphics/ImageMagick/binaries or https://yadi.sk/d/4DGwC9Ie3Lkkgo)
jbig2 (http://soft.rubypdf.com/software/windows-version-jbig2-encoder-jbig2-exe or https://yadi.sk/d/4DGwC9Ie3Lkkgo)
libiconv (http://gnuwin32.sourceforge.net/packages/libiconv.htm)

iconv gem needs to be installed separately with

gem install iconv -- --with-iconv-include="<path>" --with-iconv-lib="<path>"

(works with simple, short paths)

来源：https://stackoverflow.com/questions/21341130/pdf-compressing-library-tool

标签

pdf

compression

pdf-conversion