Chukwuma Zikora
Software Engineer. RPA and DevOps Enthusiast
Techniques for Compressing PDF Files
Learn how to compress PDF files using different techniques with Node.js, including Ghostscript and QPDF implementations.
Have you ever been tasked with reducing the size of a PDF file. Or tried uploading a PDF file to a website that has size limitation? Here I will be showing you two ways to reduce the size of PDF file using Nodejs.
NOTE: This tutorial assumes you have some knowledge of Docker and Nodejs.
Introduction
A PDF is a file format used to present a document(including texts and images), in a manner independent of the software application used to view the document. The fact that images can be embedded in a PDF document is the main reason it's size can be very huge.
Most people scan receipts and other documents to PDF, and without OCR processing, the pages are stored as images rather than text, thereby increasing the overall size of the document.
To help optimize the document we will be using Ghostscript and qpdf to come up with a grey-scaled version of the document with a resolution of 300dpi.
Environment Setup
To keep our application contained, we will be using Docker to package it.
First, we create a project folder and create these two files within the folder. Dockerfile
and index.js
. You should have something similar to the this structure.
We will be using a basic Nodejs alpine image for this.
Copying your project files and setting the command to run when the container starts are the two most important steps here.
Write the script
Since we will be using command line utilities, we will be using Nodejs' built-in child_process to execute our commands.
Solution 1 - Ghostscript
First we will need to modify our Dockerfile to add the ghostscript binary by adding a new line.
Your Dockerfile should now look similar to this.
To use convert our files, here is the command we will run against the Ghostscript binary.
Command Breakdown
-sDEVICE=pdfwrite
selects which output device Ghostscript should use. We are compressing a PDF file so we will be using pdfwrite. See this page for other options.-dCompatibilityLevel=1.5
generates a PDF version 1.5. Here's a list of all PDF versions.-dPDFSETTINGS=/printer
sets the image quality for printers. For additional compression choose /screen. Printer has a dpi of 300, while screen has 72.-dBATCH
and-dNOPAUSE
Ghostscript will process the input file(s) without interaction and will exit when completed.-dQUIET
mutes routine information comments on standard output.-sOutputFile=output.pdf
sets the path to store the compressed fileinput.pdf
the path of the file to process.
You can read the docs to see other available options. For our use case, we will use be using the above listed options.
After execution the output file name will include the compressed
and the date string
in to differentiate between the compressed file and the original file.
Your complete code should look like this.
Solution 2 - QPDF
Similar to our ghostscript setup, we will need to add qpdf to our Dockerfile
Your Dockerfile should now look similar to this.
To use convert our files, here is the command we will run against the Ghostscript binary.
As you can see from qpdf options, we are explicitly asking the library to optimize the images in our pdf file. Next, we update our code to include the qpdf command
Your complete code should look like this.
Test the code
First, let us build the image to bundle our code together with our chosen binary.
To run the command I will be mounting the /home/node/application
directory to a directory on my local machine that have the files I will like to compress so the code and reach it, and also output the compressed files in the same directory.
Conclusion
The gains made on the compression depends mostly on how many uncompressed/unoptimized images are present in the document. You can test both solutions and tweak their options until you find a combination that gives you the best result.