“Normally, I hand craft my images using vim.”

The above is a quote from Sam Ruby’s blog.  This deliciously innocent pedantry made me choke on my coffee in laughter.

The long road from HTML to PDF

HTML and PDF are the two most common formats on the web.  With good reason: HTML and friends give you a portability on the screen, while PDF gives you portability on paper.  With a few exceptions, they’re two ways of displaying the same content, optimized to different media.  So many developers will want to present their end-users with both ways of viewing their content: are you going to read this on screen, or on paper?  What’s more, developers will want the brand-image of their site also carried to the end media.

How do we achieve this?  The solution will inevitably involve conversion from a reference format, and I can see three possible architectures here:

  1. Develop in HTML; convert this to PDF when needed.
  2. Develop in PDF; convert this to HTML when needed.
  3. Develop in format X; convert this by turns to HTML and PDF when needed.

Let’s dismiss architecture no. 2 straight-away; no sane person develops in PDF.  Architecture no. 3 has some successful implementations (I’m thinking of reST here), but it has at least one problem: it makes difficult my second requirement of bringing the brand-image to both formats.  Why?  Because “brand image” will be specified somewhere external to the content itself, in a format specific to content format X, and this will then have to be transformed into CSS on the HTML side, and whatever else on the way to PDF (e.g. through LaTeX, of which I am ignorant).  This is presumably possible, but hard.  The only easy (and sensible) way architecture no. 3 can get around this is to go by the [Format X -> HTML -> PDF] route — but in this case we’re back to implementing architecture no. 1.

So, my iron logic dictates that the way of distributing content as both HTML and branding-aware PDF is to use an HTML -> PDF converter.

Now, searching for this will bring you examples of software making valiant attempts to implement all of HTML and CSS, including the CSS2 and 3 additions for paged media.  I know of three converters of this type:

  • Pisa, or XHTML2PDF.  Written in Python, under GPL or commercial license.
  • dompdf, “a (mostly) CSS 2.1-compliant HTML to PDF converter.”  Written in PHP, licensed under LGPL.
  • Prince XML.  Closed source and bloody expensive, though a free watermarking version is available.

I’m going to dismiss Pisa: my (quickly aborted) experience with it has been awful (every block element was placed in a visible box; loads of lines of CSS were mis-interpreted).  Of the dompdf library, I’ve actually used it to good effect when writing in PHP.  A couple of problems with it: I’m not a big fan of PHP (not well-suited to my ideal usage of a converter on the command line), and some things it couldn’t handle so well: in particular, pagination (with tables), the crucial ability of anything converting to a page-oriented format.

The third, Prince XML, is pretty good at what it does.  However, the free version I’ve tried out places its logo on the first page.  I don’t want this blemish on my lovely-crafted documents; nor do I want to waste my precious printer ink.  I considered hacking away the watermark: firstly inserting an extra first page via CSS, then slicing off the first page of the PDF, but this fudges things like page numbering; secondly by programmatically removing the watermark, but this came to nil too (I couldn’t figure out how to do it).  In any case, I don’t feel comfortable violating the license like that, and it’s an ugly solution.  Also, did I mention that Prince XML is bloody expensive?

It took me quite a while to realise what is now an obvious point: all these programs are reinventing the wheel.  They are literally implementing their own browser and its ability to print to PDF.  Most of you can do all that right now: File > Print > Print to File, depending on your browser.  The happy realisation is that the bulk of the work needed for our task is right here, hidden away in our open-source browsers.

But let me re-cap exactly what I want to be able to do:

eegg@pc:~ ls
eegg@pc:~ cat foo.htm | html2pdf > foo.pdf
eegg@pc:~ ls
foo.htm foo.pdf

This CLI program, html2pdf, is not so difficult to create, considering that it just has to harness the power of a browser engine.  Gecko, the Firefox engine, uses the cairo graphics library, which can produce PDFs.  It seems logical, therefore, that it could easily be harnessed (here’s one long request for that).  One attempt has been made at a plugin that allows you to order a PDF copy of the page when launching Firefox: cmdlnprint.  It works, with some fairly big shortcomings:

  • You have to have an installation of Firefox.
  • Despite being ordered from the command line, it launches a window before doing its thing.  When I first used this for a batch job, I had hundreds of windows open and my computer ground to a halt.
  • It’ll interact with your ordinary Firefox profile — if you set your paper size to A5 for a print job in your lunch hour, your later batch jobs will use the same setting.  Yes, you can create a separate profile for the conversion job, but my experience of this hasn’t been good.
  • Firefox doesn’t seem to have very good support for CSS when it comes to printing.  It doesn’t seem to play well with paper sizes, print margins, and other things.

So, what engines will work well?  Let’s have a look at some cross-browser tests for HTML5 and CSS3.  Flying ahead is Safari, the Mac browser.  Before you lament, “I’m not on a Mac!”, these results are those of WebKit, the engine behind Safari, and WebKit is open source and free.  So where is WebKit outside of Safari?  WebKit requires a widget toolkit in order to run, and the two toolkits in my world are Qt and GTK.  These both have projects at integrating WebKit; respectively, QtWebKit and WebKitGTK.

Herein lies the answer.  I stumbled across a project which seemingly has next to no publicity.  Without further ado, it’s wkhtmltopdf.  That’s WebKit HTML to PDF.  What’s more, if you’re on a Debian system, it’s in the repository.

How good is it?  I tested it against the HTML+CSS in an article a A List Apart taking about using of Prince XML.  Compared to the output that Prince XML created from it, it’s fairly impressive.