PDF Compression Explained: What Actually Happens to Your File
A technical breakdown of how PDF compression works — from font subsetting to image recompression. Understand what changes and what stays the same.
PDF Compression Explained: What Actually Happens to Your File
When you compress a PDF, the file gets smaller. But what actually changes inside the file? Is anything lost? And why do some PDFs compress to 20% of their original size while others barely shrink at all?
This is the technical explanation.
The Anatomy of a PDF
A PDF file is not a single blob of data. It is a structured container with distinct sections:
Objects — Every piece of content (text, image, font, annotation) is stored as a numbered object. A 10-page document might have 200-500 objects.
Cross-reference table (xref) — An index that maps object numbers to their byte positions in the file. This is how a PDF reader jumps directly to page 7 without reading pages 1-6.
Streams — Binary data (images, embedded files, compressed content) stored inside objects. Streams can be individually compressed using filters like FlateDecode (zlib) or DCTDecode (JPEG).
Metadata — Document properties (author, creation date, producer software, XMP data). Often 5-50KB of information you never see.
Understanding this structure explains why compression works differently on different PDFs.
The Six Compression Techniques
1. Object Deduplication
PDFs created by certain tools (especially those produced by merging multiple files) contain duplicate objects. Two identical fonts, three copies of the same logo, redundant color profiles.
Deduplication finds these duplicates and replaces them with references to a single copy. This is lossless — nothing visible changes.
Typical savings: 5-15% for merged documents, near zero for single-source PDFs.
2. Font Subsetting
A font file contains glyphs for every character the font supports — often 500-2000 characters. If your document only uses 80 characters (A-Z, a-z, 0-9, basic punctuation), the remaining glyphs are dead weight.
Subsetting strips unused glyphs from embedded fonts. The text renders identically because every character actually used is preserved.
Typical savings: 2-8% depending on the number and size of embedded fonts.
3. Stream Recompression
Some PDF generators use inefficient compression on their content streams. Recompressing these streams with optimal zlib settings can reduce their size without changing the decoded content.
This is fully lossless. The decompressed data is bit-for-bit identical.
Typical savings: 1-5%.
4. Image Recompression
This is where the big savings come from — and where quality tradeoffs happen.
Images inside a PDF are stored as embedded bitmap data (usually JPEG or PNG encoded). Recompression re-encodes these images at a lower quality setting or reduced resolution.
What changes: Pixel-level detail in images. Fine gradients may show banding. Text in images stays readable but may look slightly softer.
What does not change: Vector text (rendered from fonts), line art, shapes, form fields.
Typical savings: 30-70% of total file size, since images are usually the largest objects.
5. Metadata Stripping
PDF metadata includes:
- Author name and email
- Creation and modification dates
- Producer software (e.g., "Adobe InDesign 2024")
- XMP metadata blocks (can be 10-100KB)
- Document ID and instance ID
- Custom properties
Removing metadata is lossless in terms of visual output. The document looks and prints identically.
Typical savings: 1-3% for normal documents, up to 10% for metadata-heavy files.
6. Linearization Removal
Linearized PDFs ("fast web view") contain extra data structures that allow page-at-a-time loading over HTTP. This adds overhead. If the PDF will be viewed locally (not streamed from a web server), removing linearization saves space.
Typical savings: 1-2%.
Why Some PDFs Barely Compress
If your PDF shrinks by only 5-10% at maximum compression, it is likely because:
-
Already compressed images — JPEG images inside the PDF are already compressed. Re-encoding them gives diminishing returns and introduces generational quality loss.
-
Text-only content — Pure text PDFs with subsetted fonts are already small. There is little to remove.
-
Pre-optimized output — Some PDF generators (Adobe Acrobat's "Reduce File Size" feature) already apply most of these techniques.
Why Some PDFs Compress Dramatically
A 50MB PDF that compresses to 5MB likely has:
-
Uncompressed images — Some tools embed BMP or PNG images where JPEG would be appropriate.
-
Full font embeddings — The complete font file (500KB-2MB per font) when only a subset is needed.
-
Duplicate resources from merging — Each merged PDF brought its own copy of common fonts and resources.
-
Excessive resolution — Images embedded at 600 DPI when 150 DPI is sufficient for screen viewing.
The Quality Spectrum
Think of compression quality as a slider between two extremes:
| Compression | Image Quality | File Size | Use Case | |------------|---------------|-----------|----------| | None | Original | 100% | Archival | | Minimal | Indistinguishable | 70-90% | Legal documents | | Balanced | Excellent on screen | 40-60% | Email, sharing | | Maximum | Good on screen | 20-40% | Drafts, reference |
Text rendered from fonts is never affected by compression. Only raster images (photos, scans, embedded graphics) are subject to quality reduction.
Try It Yourself
The best way to understand compression is to see it on your own files. Compress a PDF with Naqia — try all three levels on the same file and compare the results. The tool runs in your browser, so your file stays private.