Reducing file sizes of PDFs created using matplotlib by changing font embedding

问题

I'm using matplotlib to produce PDF figures. However, even the simplest figures produce relatively large files, the MWE below produces a file of almost 1 MB. I've become aware that the large file size is due to matplotlib fully embedding all the used fonts. Since I'm going to produce quite a few plots and would like to reduce the file sizes, I'm wondering:

Main question:

Is there a way to get matplotlib to embed font subsets instead of the complete fonts? I would also be fine with not including the fonts at all.

Things considered so far:

A vector graphics editor can readily be used to export a PDF including font subsets (as well as not including fonts at all), but having to perform this step for every file (revision) appears unnecessarily tedious.
Similarly, I've read about post-processing PDF-files (e.g. using Ghostscript), though the effort seems comparable.
I tried setting 'pdf.fonttype'= 3, which does indeed produces considerably smaller files. However, I'd like to keep the text modifiable in vector graphics editors - which doesn't seem to work in this case (for example minus-signs will not be saved as text).

Since it is easy, though labor-intensive, to produce files with embedded subsets using external software, is it somehow possible to achieve this directly in matplotlib? Any help would be greatly appreciated.

MWE

import matplotlib.pyplot as plt #Setup
import matplotlib as mpl
mpl.rcParams['pdf.fonttype'] = 42
mpl.rcParams['mathtext.fontset'] = 'dejavuserif'
mpl.rc('font',family='Arial',size=12)

fig,ax=plt.subplots(figsize=(2,2)) #Create a figure containing some text
ax.semilogy(1,1,'s',label='Text\n$M_\mathrm{ath}$')
ax.legend()
fig.tight_layout()
fig.savefig('test.pdf')

Environment: matplotlib 3.1.1

回答1:

The PGF backend helps to reduce a PDF file size dramatically. Just add mpl.use('pgf') to your code. In my environment, this amendment leads to the following:

File size decreases from 817K to 21K (40 times smaller!).
Execution time increases from 1s to 3s.

However, for real figures, the execution time often decreases along with the file size.

The reduction in PDF size is attributed to embedding subsets of fonts.

$ pdffonts pdf_backend.pdf
name                         type              emb sub uni prob object ID
---------------------------- ----------------- --- --- --- ---- ---------
ArialMT                      CID TrueType      yes no  yes          14  0
DejaVuSerif-Italic           CID TrueType      yes no  yes          23  0
DejaVuSerif                  CID TrueType      yes no  yes          32  0

$ pdffonts pgf_backend.pdf
name                         type              emb sub uni prob object ID
---------------------------- ----------------- --- --- --- ---- ---------
KECVVY+ArialMT               CID TrueType      yes yes yes           7  0
EFAAMX+CMR12                 Type 1C           yes yes yes           8  0
EHYQVR+CMSY8                 Type 1C           yes yes yes           9  0
UVNOSL+CMR8                  Type 1C           yes yes yes          10  0
FDPQQI+CMMI12                Type 1C           yes yes yes          11  0
DGIYWD+DejaVuSerif           CID TrueType      yes yes yes          13  0

Another option is to produce an EPS file (using the PostScript backend) and convert it to the PDF format, e.g., by epstopdf (using the GhostScript interpreter). This way reduces the PDF file to 9K. However, it is worth noting that the PS backend does not support transparency.

回答2:

Leaving this here in case anybody else might be looking for something similar: After all, I decided to opt for Ghostscript. Due to the extra step it is not exactly what I was looking for, but at least it can be automated:

import subprocess
def gs_opt(filename):
    filenameTmp = filename.split('.')[-2]+'_tmp.pdf'
    gs = ['gswin64',
          '-sDEVICE=pdfwrite',
          '-dEmbedAllFonts=false',
          '-dSubsetFonts=true',             # Create font subsets (default)
          '-dPDFSETTINGS=/prepress',        # Image resolution
          '-dDetectDuplicateImages=true',   # Embeds images used multiple times only once
          '-dCompressFonts=true',           # Compress fonts in the output (default)
          '-dNOPAUSE',                      # No pause after each image
          '-dQUIET',                        # Suppress output
          '-dBATCH',                        # Automatically exit
          '-sOutputFile='+filenameTmp,      # Save to temporary output
          filename]                         # Input file

    subprocess.run(gs)                                      # Create temporary file
    subprocess.run(['del', filename],shell=True)            # Delete input file
    subprocess.run(['ren',filenameTmp,filename],shell=True) # Rename temporary to input file

And then calling

filename = 'test.pdf'
plt.savefig(filename)
gs_opt(filename)

This will save the figure as test.pdf, use Ghostscript to create a temporary, optimized test_tmp.pdf, delete the initial file and rename the optimized file to test.pdf.

Compared to exporting the file with a vector graphics editor, the resulting PDF created by Ghostscript is still a few times larger (typically 4-5 times). However, it is decreasing the file size to something between 1/5 and 1/10 of the initial file. It’s something.

来源：https://stackoverflow.com/questions/60076026/reducing-file-sizes-of-pdfs-created-using-matplotlib-by-changing-font-embedding

标签

python

matplotlib

pdf

fonts

font-embedding