The long road from HTML to PDF


HTML and PDF are the two most common formats on the web.  With good reason: HTML and friends give you a portability on the screen, while PDF gives you portability on paper.  With a few exceptions, they’re two ways of displaying the same content, optimized to different media.  So many developers will want to present their end-users with both ways of viewing their content: are you going to read this on screen, or on paper?  What’s more, developers will want the brand-image of their site also carried to the end media.

How do we achieve this?  The solution will inevitably involve conversion from a reference format, and I can see three possible architectures here:

  1. Develop in HTML; convert this to PDF when needed.
  2. Develop in PDF; convert this to HTML when needed.
  3. Develop in format X; convert this by turns to HTML and PDF when needed.

Let’s dismiss architecture no. 2 straight-away; no sane person develops in PDF.  Architecture no. 3 has some successful implementations (I’m thinking of reST here), but it has at least one problem: it makes difficult my second requirement of bringing the brand-image to both formats.  Why?  Because “brand image” will be specified somewhere external to the content itself, in a format specific to content format X, and this will then have to be transformed into CSS on the HTML side, and whatever else on the way to PDF (e.g. through LaTeX, of which I am ignorant).  This is presumably possible, but hard.  The only easy (and sensible) way architecture no. 3 can get around this is to go by the [Format X -> HTML -> PDF] route — but in this case we’re back to implementing architecture no. 1.

So, my iron logic dictates that the way of distributing content as both HTML and branding-aware PDF is to use an HTML -> PDF converter.


Now, searching for this will bring you examples of software making valiant attempts to implement all of HTML and CSS, including the CSS2 and 3 additions for paged media.  I know of three converters of this type:

  • Pisa, or XHTML2PDF.  Written in Python, under GPL or commercial license.
  • dompdf, “a (mostly) CSS 2.1-compliant HTML to PDF converter.”  Written in PHP, licensed under LGPL.
  • Prince XML.  Closed source and bloody expensive, though a free watermarking version is available.

I’m going to dismiss Pisa: my (quickly aborted) experience with it has been awful (every block element was placed in a visible box; loads of lines of CSS were mis-interpreted).  Of the dompdf library, I’ve actually used it to good effect when writing in PHP.  A couple of problems with it: I’m not a big fan of PHP (not well-suited to my ideal usage of a converter on the command line), and some things it couldn’t handle so well: in particular, pagination (with tables), the crucial ability of anything converting to a page-oriented format.

The third, Prince XML, is pretty good at what it does.  However, the free version I’ve tried out places its logo on the first page.  I don’t want this blemish on my lovely-crafted documents; nor do I want to waste my precious printer ink.  I considered hacking away the watermark: firstly inserting an extra first page via CSS, then slicing off the first page of the PDF, but this fudges things like page numbering; secondly by programmatically removing the watermark, but this came to nil too (I couldn’t figure out how to do it).  In any case, I don’t feel comfortable violating the license like that, and it’s an ugly solution.  Also, did I mention that Prince XML is bloody expensive?


It took me quite a while to realise what is now an obvious point: all these programs are reinventing the wheel.  They are literally implementing their own browser and its ability to print to PDF.  Most of you can do all that right now: File > Print > Print to File, depending on your browser.  The happy realisation is that the bulk of the work needed for our task is right here, hidden away in our open-source browsers.

But let me re-cap exactly what I want to be able to do:


eegg@pc:~ ls
foo.htm
eegg@pc:~ cat foo.htm | html2pdf > foo.pdf
eegg@pc:~ ls
foo.htm foo.pdf

This CLI program, html2pdf, is not so difficult to create, considering that it just has to harness the power of a browser engine.  Gecko, the Firefox engine, uses the cairo graphics library, which can produce PDFs.  It seems logical, therefore, that it could easily be harnessed (here’s one long request for that).  One attempt has been made at a plugin that allows you to order a PDF copy of the page when launching Firefox: cmdlnprint.  It works, with some fairly big shortcomings:

  • You have to have an installation of Firefox.
  • Despite being ordered from the command line, it launches a window before doing its thing.  When I first used this for a batch job, I had hundreds of windows open and my computer ground to a halt.
  • It’ll interact with your ordinary Firefox profile — if you set your paper size to A5 for a print job in your lunch hour, your later batch jobs will use the same setting.  Yes, you can create a separate profile for the conversion job, but my experience of this hasn’t been good.
  • Firefox doesn’t seem to have very good support for CSS when it comes to printing.  It doesn’t seem to play well with paper sizes, print margins, and other things.

So, what engines will work well?  Let’s have a look at some cross-browser tests for HTML5 and CSS3.  Flying ahead is Safari, the Mac browser.  Before you lament, “I’m not on a Mac!”, these results are those of WebKit, the engine behind Safari, and WebKit is open source and free.  So where is WebKit outside of Safari?  WebKit requires a widget toolkit in order to run, and the two toolkits in my world are Qt and GTK.  These both have projects at integrating WebKit; respectively, QtWebKit and WebKitGTK.


Herein lies the answer.  I stumbled across a project which seemingly has next to no publicity.  Without further ado, it’s wkhtmltopdf.  That’s WebKit HTML to PDF.  What’s more, if you’re on a Debian system, it’s in the repository.

How good is it?  I tested it against the HTML+CSS in an article a A List Apart taking about using of Prince XML.  Compared to the output that Prince XML created from it, it’s fairly impressive.

Advertisements

13 responses to “The long road from HTML to PDF

  1. Brian

    Thanks for this. I’ve spent hours looking for some way to batch process html files. My bash script of html2ps then ps2pdf didn’t preserve any of the CSS, and even when I rewrote the html with clunky old table attributes, html2ps still insisted on making full-sized images rather than resizing them per the width attribute.

    How odd that Mozilla doesn’t have this as part of their command line options!

    Anyway, installed wkhtmltopdf and it works fantastically. Thanks for the info!

  2. a

    can you post your results? (i.e. the two different PDF’s generated by prince and webkit)

  3. Thanks very much for pointing me in the right direction :-). Users of my open source statistics and reporting package (http://www.sofastatistics.com) have been saying for a while they need PDF support. Here are some additional pieces of relevant information:

    1) wkhtmltopdf is in the debian/ubuntu repositories so it is very easy to install in that environment.
    2) wkhtmltopdf is easy to use:
    wkhtml2pdf http://www.sofastatistics.com Desktop/sofa_snapshot.pdf
    3) The resulting PDFs keep scalable images and text scalable and the text is orderly and sortable – even in things like svg image tooltips made with the Dojo Javascript toolkit
    4) Page break issues need to be addressed but probably can: http://www.adras.com/Converting-HTML-to-PDF-with-Webkit.t21829-49.html

  4. Thanks!

    I could have used this 5 years ago! I built a queue system w/ slave windows virtual machines w/ firefox & flash & print to pdf. We’re going to be using this for everything that ever requires PDF on our production projects from now on. Thanks!!!

  5. Simon

    Nearly every search on google is finally pointing to wkhtmltopdf (if looking for freeware). It does its job pretty well. Well, unless I stumbled across a blocker. The font kerning gets totally screwed up in my PDFs (and as there exists a bug ticket, I’m not the only one), regardless of the font and type (embedded, ttf, otf, …).

  6. Francisco Vieira

    Thanks! I was about to have to implement my reports in JasperReports since PrinceXML was not free when I found this post refering wkhtmltopdf! You just saved me two weeks of work!

  7. Hi!

    I discovered wkhtmltopdf today and it works very well but, can you copy text from the resultant pdf?

    Manuel Viera.

  8. Dan

    Yeah wkhtmltopdf definitely rocks. We’re using it as a basis for our website too and it can do a lot of stuff.

    A little side note though: if you really want all the power that PDFs can harness – you need some custom rework and possibly extra programs like pdftk….

    If you want to see a live working example: http:/www.htm2pdf.co.uk

  9. I’m trying desparately to move from another expensive, proprietary solution to wkhtml2pdf but headers/footers are kicking my butt. It needs CSS3 paged-media support and we’d be all set. So far I can’t figure out how to get it to work if it is supported. Soon I start digging into source (I’ve already begun contributing). Anyone else looking for this support?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: