The good old 1955s

We divide our time up into regular blocks. Seconds, minutes, hours, days, weeks, months, years, decades, centuries, millennia.

When does one block end and the next start?  With most of our divisions of time, the decision is arbitrary.  January 1 is arbitrary.  The start of a new decade, or a new century, is decided by the year we have arbitrarily decided to call Year 1 AD.

So much might be obvious.  The decade of the 1980s is only as valid as the decade 1984–1993.  The 20th century is only as extant as  the century running 1873–1972.

But we attach things to our arbitrary periods of time.  The 1940s was a decade of war; the 1970s was the decade of disco.  The fifteenth century was the last in the era of knights and maidens; the sixteenth was the first in the era of art and science.

All this raises an interesting question.  What would the decade 1975–84 look like?  Would this decade capture the disco of the 70s, or the hair of the 80s?  What would the century 1850–1950 look like?  Would this be the century of world war — in which case we have lived not 12, but 62 years since that violent century?

Let’s call that decade the 75s, and that century the 19.5th century.  Write in and tell me what these new unexplored decades would look like.


Is a crime an occupation?

There are kinds of activity for which we not only have names for the activity, but also for the doer of the activity.  Not only sculpting but sculptor; not only drinking but alcoholic; not only psychotherapy but psychotherapist.  Clearly, when we refer to a person in this way, the implication is that the activity is part of that person’s identity and that they do it frequently and regularly.

As it happens, we also do this for crimes.  We not only have names for crimes, but for “criminals.”  We not only have theft, we have thief.  We don’t only have murder, we have murderer.  Not only rape but rapist.  And notice that I placed “criminals” in quotation marks: we not only have crime, we have criminal.  As before, the implication is that the person commits crime frequently and regularly and that crime is an inseparable part of that person.

Is this correct?  For what number of those that we call “thieves” is thievery a part of their identity?  How many of those that we refer to as “murderers” commit murder often and regularly?  An extreme minority.  Referring to a person that committed murder as a “murderer” is as correct as referring to me as an “astronomer” because I once looked through a telescope.

Why then do we do we refer to people that have broken the law in these terms?  I believe the major reason is a desire to dehumanize in order to not understand: she did not commit rape because of a complex of causation leading to the event, or because of the situation in which she did it — rather, the act of rape revealed her true identity as a rapist.  The causal explanation of guilt, which is difficult to understand, is substituted by the guilt itself, which is not an explanation at all and so is easy to understand.

Domain boycotting


I describe a potential web application for coordinating boycotts of corporations sponsoring potential US legislation.  Avoiding domains owned by these corporations, who are often large media conglomerates, will result in lost advertising revenue.  The application is a browser plugin that warns the user when she tries to visit a domain owned by these corporations, and posts to Twitter when she does not visit the domain.


In my last post I described potential ways in which web apps could enable successful boycotts.  To re-cap, I described a system which

  • “crowdsources” information of objectionable things (such as SOPA or animal cruelty) and the corporations that support them;
  • builds a base of users that can register as boycotting supporters of, for example, SOPA;
  • provides software to users to aid their purchasing decisions using their boycotts as a guide;
  • provides corporations with estimates of their losses.

There was some discussion on Reddit about it.

Existing sites

Reddit users helped me find some incipient sites that are trying to organize boycotts.  Among these are:

Apart from BuyorBoycott, which is a single-page static site at the moment, these sites are discussion boards, with some ability to join boycotts, and some ability to post to Twitter.

Changes since last post

In light of these, I’ve been trying to identify an MVP (Minimum Viable Product) that satisfies my USP (Unique Selling Point).  /marketingjargon

What would this minimal product look like?  Contemplating this, my initial plans have changed a bit:

  • “Objectionable things” is too general to be of use; we currently should concentrate on a specific subcategory of this.  For me, this subcategory is potential legislation; specifically, US bills.
  • Crowdsourcing is unnecessary for this category: lists of corporate sponsors of bills are at least semi-official.  If these can’t be automatically gathered, we could use vetted volunteers.
  • We don’t have to track purchasing decisions, because I’ve thought of another thing: people can boycott domains.  SOPA was supported by, among many others, News Corporation and Viacom; these corporations profit from advertising and they own many, many websites.  Coordinated efforts not to use them are a form of boycott.
  • We create a browser plugin that warns the user when they try to visit a domain owned by a corporation they are boycotting.  This should be simple.
  • We don’t have to build a real user base, or social communication tools, or the tools to provide corporations with estimated losses, because there’s a single tool out there that can do all of this: Twitter.  Users should log in with their Twitter accounts via OAuth.  With each domain visit that the user declines, we send a tweet saying so.  This has three effects:
    • we recruit new users
    • we create bad publicity for the corporation
    • we provide the corporation with real-time information on how their endorsement is affecting their business.


What work is required to implement this?

A knowledge database

A knowledge-database schema containing:

  • All US bills.
  • All corporations, and their parent corporations.
  • Links between bills and supporting corporations.
  • Links between domains and corporations.

Data sources

Population of those tables automatically or semi-automatically:

An API to the database

A web service allowing queries of the form: given opposed bills Bs and domain D, is D part of a boycott of a bill in Bs?  The implementation is simple: this asks if there exist corporations and P and a bill such that:

  • bill B ∈ bills Bs
  • corporation P endorses bill B
  • corporation C is a subsidiary of corporation P (or equal to corporation P)
  • corporation C owns domain D

A browser plugin

A Firefox or Chrome plugin.  The user provides it with a list of opposed bills, Bs, and optionally their Twitter details.  Every time the browser attempts to visit some domain D, the plugin queries the web service, is D part of boycotts of Bs?, then

  • if no: allow page load.
  • if yes (returning the specific bill B and corporations P and C): tell the user, D is part of your boycott of X supporters; cancel this page visit?
    • if no: allow page load.
    • if yes: cancel page load, and post a tweet: “I’m not visiting D because #ImBoycottingX

A web front-end

A web app that Twitter users can log into using OAuth.  A front-end to the website with

  • plugin downloads
  • stats on the latest boycotting activity on Twitter (based on a search for tweets of the form “I’m not visiting D because #ImBoycottingX“)
  • a smart brand

I can certainly get started on this.  I’m looking for volunteers to help out.  Any takers?

Edit 1: links to OpenCongress and OpenSecrets.  Thanks, rasori!

Boycotting for the masses: a web solution


There is much that is wrong with the world and corporations are often to blame.  The most effective method of protest against the corporation is the boycott.  But boycotts are hard work requiring far too much time and effort.  Targeted boycotts are easier but leave the majority of guilty corporations unpunished.  The problem is that boycotts are uncoordinated and require instant access to information at every potential purchase.  I offer a potential web solution with four components: (1) a crowdsourced machine-readable database of objectionable things and their supporters; (2) a user area where users can register their participation in boycotts of those things; (3) user software to aid purchasing by providing instant information on whether a given product should be avoided; (4) a public website showing manufacturers’ estimated losses as a result of their actions.

Introduction: why boycott?

We live in a world of corrupt politicians and psychopathic corporations.  This much is entirely uncontroversial, and there’s no need for examples here.

Here’s the real problem: what the fuck can we do about it?

There are many tactics: letter-writing, indignation, using your vote, peaceful protest, violent protest, website banners, and so on.  They have their place but their effectiveness is limited for the simple reason that they don’t attack the enemy where it hurts.

We must understand that corruption is fundamentally driven by money and profit, not by ignorance, immorality, chutzpah, an illusion of public support, or anything else.  And so it must be here that we attack it.

There’s one simple tactic that does so: the boycott.

Boycotts are hard

In January 2012, Maddox posted an article about SOPA, the conclusion being that boycotting is the only way we’ll stop SOPA and everything like it that will follow.

What the article doesn’t address is the subsequent problem: boycotting is hard.  Why?

  • There are hundreds of organizations that officially support SOPA/PIPA/the next incarnation of the many bills designed to take away your freedom.
  • Of those, most will have many subsidiary companies or other connected organizations, meaning potentially thousands of brands one has to be aware of.
  • We buy things all the time and we don’t have the time for research into the political stance of our shampoo manufacturers (etc).
  • This is just one boycott!  The well-informed person may want to boycott many pieces of legislation, many corporations, many states, entire industries they disapprove of, and so on.

Here are some scenarios demonstrating the barrier:

  • You’re at your local shopping mall looking to buy X.  There are several shops that sell X.  But which of them support that new bill Y you hate so much?  No time to find out …
  • You’re at your local convenience store looking to buy X.  There are several different brands of X.  But which of them are associated with sweatshop work?  No time to find out …
  • You’re shopping for Xs in the vegetable aisle.  There are Xs imported from countries YZ, and Q.  But which of those have a horrible foreign policy?  No time to find out …
For me, these barriers are far too high: to research this effectively I would have to give up my job (which in turn would remove my income that gives me my power to boycott).

A non-solution: select, target, scapegoat

But we know that boycott can be effective at a critical mass of support and media coverage.  This is what happened with GoDaddy.

This is the basis of Maddox’s proposal: choose a small number of companies and hit them hard.  This targeted, scapegoating approach is based on the understanding that most people can’t be bothered to do the research.

It is more effective than the “learn this list of companies” approach.  The problem is that all those other companies get off free!

A potential solution: the web, crowdsourcing, and purchasing adviser software

The good news is that the number of people who would like to take part in boycotts is far larger than those that have the time and determination to do so.  There is latent energy to be unleashed.  Unleashing this energy must be done by lowering the barrier to entry.  In short, if I’m going to boycott, then the research must be done for me and be instantly accessible.

My proposed solution has four key parts:

  1. A publicly accessible, well-researched, up-to-date, independent, crowdsourced database of wrong-doings and their supporters.  The world already has a half-solution: Wikipedia, the success of which relies on user-generated content.  However, it is unstructured content designed for the casual reader, and the information is not targeted to boycotters.
  2. A users’ site where the boycotter can declare the causes that they support.  In conjunction with the above database the site can then produce a comprehensive list of corporations/states/etc they should boycott.
  3. Software to help the user assess individual purchases.  For example:
    1. A browser plugin.  This has the following components:
      1. Access to the user account on the above site.  It therefore knows the manufacturers (etc) the user wishes to avoid.
      2. Access to databases of individual products (such as the Household Products Database).  It therefore knows, given a product, whether the user should avoid it.
      3. Access to the user’s browsing and the ability to inject warnings.  For instance, when shopping on, the plugin highlights products to boycott.
      4. The opt-in ability to supply the user’s boycotting history to the users’ site.
    2. A barcode-scanning smartphone app.  Similar components to the above plugin, with the ability to identify a product from its barcode (like existing price comparison apps, e.g. Scandit for iPhone or Barcode Scanner for Android).
  4. An online summary of boycott effectiveness.  If companies don’t know why their sales are falling, they won’t change their stance!  Using data volunteered by users, we can publish (user-anonymized) estimates of how many dollars a manufacturer has lost due to their support of such-and-such.

Here’s a quick feasibility study:

  • The required technology exists and is mature.  User-produced, user-audited content is everywhere.  Independent product databases exist.  Barcode-scanning is reliable.
  • People are comfortable with the technology.  Users already guide their purchases with price comparison and user-review websites/apps.
  • Initial costs are for the software; volunteers and funding should be available.  The open-source voluntary model is successful.  Necessary funding could also be found on, say, Kickstarter.
  • Ongoing costs are mainly for servers, and other sites get by.  Non-profits like Wikipedia survive on voluntary contributions and this should be no different.

So, what do you say?  LET’S BOYCOTT!

The Thatcher effect in typography

Near my home, on my dog-walking route, there’s a small business called ‘FDM.’  Their initials are emblazoned in 2000pt capitals on the side of the property.  I don’t know what it stands for, or what they do.  The reason I bring them up is a one screamingly disrespectful disregard for typography, in what is an otherwise entirely sober Roman-esque sign.

I’m going to first show it to you upside down, in a fictional billboard advertisement sponsored by the lovely Jayma Mays:

Lovely — both of them sexy and sophisticated; both with subtle, clean curves that demand attention precisely due to their understatedness; both enticing you, by just giving a little away, to look further.


Yes, until the potential customer stops sitting in the driver’s seat upside-down, or reverts from their hand-walking on the pavement:

WTF?!  Or should I say, MTF?  The ‘M’ has suddenly broken in two and fallen in on itself.  Why was it considered a good idea to use an upside-down ‘W’ in place of the ‘M’ into which the artist had poured countless hours of labour in order to be completely unobtrusive?

Is it intended as ‘attention-grabbing’? It works; but not all publicity is good publicity — I’m going to have to find myself a new dog-walking route.

p.s.; I’m sorry, Jayma.  Every stroke of the GIMP brush was like a dagger in your baby-soft skin.  But it was in the name of Typography!

Page margins in principle and practice

Imagine a book with no page margins. The text runs right to the edge of the paper. You’ll have to crack the spine to gain access to characters in the gutter. To access the text at the bottom, you have to move your thumbs our of the way. If the book is a little old, the characters on the outside may be worn off entirely. No header or footer is present, so navigation is a task. To make a note on something you find, you’ll have to write it between the lines. We haven’t even mentioned the fact that the book looks horrible, or has forced the publisher through hoops to produce the book in full bleed.

The margin, then, is an essential element of all paged media. It solves all the horrors above: spinal injuries are greatly reduced; the closed book when dropped has its content protected by a chunk of wood; you can hold it comfortably; page numbers and section titles guide you around; you have space for marginalian comments; the composition is pleasing; and ‘printing on the edge’ is no longer an issue (you can’t do it with your home printer).

Competing rationales

So margins are a Good Thing. In implementing them, though, we’ll have to be more specific: how big should the four margins (top, bottom, inside, outside) be, given the size of the book?

This simple question doesn’t have a simple answer. The big reason for this is competing rationales; for each design consideration there is a different optimum:

Goal Ideal margin appearance
Save the book’s spine Give precedence to the inside margin, especially in fat books.
Blank space for holding the book Precedence to the bottom and outside margins.
Wear does not affect book content Precedence equally to all but the inside margin, which doesn’t wear.
Navigation is easy Precedence to the top and bottom margins.
Ample space for reader’s notes Precedence to outside margin.
Pleasing composition Printed area vertices lie on page “ley lines”; geometrical ratios.
Don’t require print bleed At least 5mm on all margins.

So different goals are pushing in different, sometimes opposite, directions. Some goals are independent of the page area; some are not. Some are independent of page ratio, others not. Some are dependent on book length, others not, and so on.

Canons of page construction

Let’s begin with the most complex of the design goals: pleasing composition. Wikipedia calls these considerations the canons of page construction. The geometrical means of constructing an ideal page seem surprisingly long-standing and agreed upon. Surprisingly, though, I couldn’t find implemented algorithms online, so over at github I’m hosting a small library I’ve written for this article.

The first principle is that some ratios are better than others. These ratios should be applied both to the page and to the printable area. The less ratios in the composition, the letter. They are:

  • 2 : 3
  • 1 : φ (the golden ratio)
  • 1 : √2 (the ratio governing A3, A4, A5 paper, etc.)

The second principle is that the rectangle defining the printable area should have vertices that lie on what I would call “ley lines”. If you have a two-page spread in front of you, these lines are those you can draw between five vertices: the four corners of the book plus the top of the gutter.

A third principle, not always applied, is that the print width should be the same as the page height.

A few different methods exist over at Wikipedia, aiming for the above goals. Most actually boil down to the same result: the Van de Graaf canon.  This is the most general algorithm, and the two other main methods obtain the same result when the page ratio is 2 : 3.  I’ll let you judge for yourself whether the results are pleasing. In order of decreasing page height, these are outputs from my above script:

The Van de Graaf Canon at 1:φ page ratio

The Van de Graaf Canon at 2:3 page ratio

The Van de Graaf Canon at 1: √2 page ratio

The Van de Graaf Canon at 1:1 page ratio

The Van de Graaf Canon at 2:3 page ratio

Let’s say you agree with me that the above are beautiful.  Note that, looking at the right-hand page:

  • the printed area is the same shape as the page area
  • the top left and bottom right vertices of the printed area lie on the diagonal of the right-hand page
  • the top right vertex of the print area lies on the diagonal of the two-page spread
  • the gutter margins together are the same width as an outer margin

Meditation on the Van de Graaf

I would guess that your first thought after looking at these (other than that they’re attractive), is that they are so liberal in their use of space.  I would secondly guess that your thinking that derives from experience: have you ever seen a bottom margin as big as that on the 1:φ ratio?  Why not?  The answer lies firstly in the competing rationales above, and secondly in more rationales, aiming to reduce margins to nothing, that I now list here:

  • Spend less on paper. Paper margins and profit margins aren’t friends.
  • Save trees. Publishers and consumers alike are conscious of the environment.
  • Save space and weight. Less margin means paying for less shelf space at the bookshop.  Volume and weight could be halved, which reduces transportation costs for the publisher and consumer (you’re travelling with your Comprehensive Travel Guide to Asia; decide between buying the Van de Graaf edition, or buying a zero-margin copy letting you squeeze another pastel-coloured holiday novel in your suitcase).


I’ve got about as far as I can with theory.  I decided at this point to take some measurements of some books on my shelf.  (The data from my twenty samples is in an OO spreadsheet in the about git repo.)  Here’s some summary findings:

  • Books, and the printable area, are taller and thinner than Jan Tischichold’s use of the 2:3 ratio.  The average is 2:3.1 for page size, and 2:3.3 for printable area — hovering around 1:φ but not hitting it.
  • The printable area was invariably taller and thinner than the page area, compared to the constant ration of the Van de Graaf.
  • Publishers can’t decide between bigger outside margins (space for comments, &c.) and bigger inside margins (saving cracked spines).  On average, the ratio of inside to outside was 1:1, but few or not hit that deliberately.

The Typography of Discworld

I found many page constructions that were unappealing, niggardly, and un-functional.  However, I did come across a few that were not.  One neat one is in the Corgi editions of Terry Pratchett’s Discworld books.  The construction is as follows.

Draw the diagonals of the full-page spread.  Next draw the ‘V’-shape as in the Van de Graaf, but upside-down.  Mark the verticals at the intersections (thus dividing the spread into three equal slices, as in the Van de Graaf).  Draw diagonals from the intersections to the bottom of the vertical on the opposite page.  The new intersections are the inside bottom of the printable areas.  Draw two more diagonals, from the top of the verticals to the outside bottom corners of the same page.  From the two known printable area vertices, draw horizontally; the intersection at the new diagonal is the outside bottom of the printable area.  Finally, draw vertically until you hit the ‘V’; this marks the top of the printable area.  Look at it:

Geometrically constructing the page of a Corgi Discworld book (my image)

Constructing the Van de Graaf canon on the same page spread

The Discworld canon certainly looks similar in geometric spirit, but they are dissimilar in other ways.  The Discworld gives you more page area (66% compared to 44%).  It throws most of the outside margin away, disregarding the aesthetic principle that the gutter width should be the same as one outside margin.  Take a look at it on the same page ratios I used earlier:

The Discworld canon at 1:φ ratio

The Discworld canon at 2:3

Discworld canon at 1:√2

Discworld canon at 1:1

Discworld canon at 3:2

Word processors, and designing a single-page canon

How do today’s word processors implement page margins? The first thing to note is that, by default, all pages are symmetrical: there’s no such thing as left and right pages. Considering that its output will most likely be unbound A4 from a home printer, this makes sense.

Specifically, the following margins are set by default (My figures for Word are based on Google; I don’t have access to an installation myself):

Word processor Top margin Bottom margin Inside margin Outside margin
Microsoft Word 1″ 1″ 1.25″ 1.25″
OpenOffice Writer 20mm 20mm 20mm 20mm

How were these figures decided upon?  My guess is that foremost they’re fetishes for an individual measurement system: Word the imperial, Writer the metric.  Neither seems to be a good basis for a sensible default.  For example, when working with an A4 page, in which the long side is an irrational figure (210mm × √2), we would expect irrational figures for the margins, too.  Word’s decision to go for larger side margins than end margins is especially odd; my survey above put side margins at 60% of end margins.

First attempt at a single-page canon

The reason that word processors don’t have agreed sensible margins is that, seemingly, no canon has been designed for pages that are not part of a two-page spread.  So, why not use our principles from above to create one?  Let’s proceed:

  1. We this time only work with four starting vertices: the four corners of a rectangle.
  2. There are only two ‘ley lines’: the diagonals from one corner to its opposing corner.
  3. We can (and so will) place the printable area’s vertices on these leylines.
  4. The less ratios, the better: let’s use the page ratio for the printable area ratio.  (So far is equivalent to a scaling-down of the page rectangle on its centre.  We just need one principle to fix the scale…)
  5. The printable height is equal to the page width.

We end up with the following:

The naïve single-page canon

The naïve single-page canon at 2:3

Lovely!  We’ve just bettered the two biggest word processing packages!  Or have we …

Naïve single-page canon at 1:2. Only 25% of the page used.

Naïve single-page canon at 1:1. 100% of the page used!

Naïve single-page canon at 3:2. A full 225% of the page used!

A successful single-page canon: the Double-Circle

Working on a 2:3-ratio sheet, I disregarded the other possibilities.  The naïve canon above varies the printable area wildly with the page ratio, and it won’t do.  We need one that produces sensible results for all page shapes: tall, square, and fat.  I’ve developed one that does so, and I call it the Double-Circle Canon.  Here it is:

Double-Circle canon at 2:3

And, as before, let’s see it in action:

The Double-Circle canon at 1:φ

The Double-Circle canon at 2:3

The Double-Circle canon at 1:√2

The Double-Circle canon at 1:1

The Double-Circle canon at 3:2

Redundant information in unordered lists: fundamental?

Let’s say you have an list of items, in some specific ordering: the list of your friends [james, tom, harry], say, in order of age.  The way I see it, there are two “types” of information here: the list items, and the list order.

Now, let’s say you want to make a list of specific “friendships”: [(eegg, james), (eegg, tom), (eegg, harry), (james, tom), (tom, harry)].  Now, there are several ‘orderings’ in this list that could be used to convey information: the order in which you list the friendships, and the order in which you list the two friends in each 2-tuple.

But what information can you convey using these orderings?  When specifying a friendship between two people, can you identify two “roles” played there?  Answer: no.

So you decide that you want to specify those friendships without putting any information in those orderings.  As you have many friends and space is at a premium, you also decide that you want to compress that data to “squeeze out” the wasted data that is taken up by the arbitrary order in which everything is specified.  Open question: how do you go about this?

Or to put the question a little more mathematically/computer-sciencey: what is the most spacially efficient way of serializing an arbitrary unordered set of items?

“Normally, I hand craft my images using vim.”

The above is a quote from Sam Ruby’s blog.  This deliciously innocent pedantry made me choke on my coffee in laughter.

The long road from HTML to PDF

HTML and PDF are the two most common formats on the web.  With good reason: HTML and friends give you a portability on the screen, while PDF gives you portability on paper.  With a few exceptions, they’re two ways of displaying the same content, optimized to different media.  So many developers will want to present their end-users with both ways of viewing their content: are you going to read this on screen, or on paper?  What’s more, developers will want the brand-image of their site also carried to the end media.

How do we achieve this?  The solution will inevitably involve conversion from a reference format, and I can see three possible architectures here:

  1. Develop in HTML; convert this to PDF when needed.
  2. Develop in PDF; convert this to HTML when needed.
  3. Develop in format X; convert this by turns to HTML and PDF when needed.

Let’s dismiss architecture no. 2 straight-away; no sane person develops in PDF.  Architecture no. 3 has some successful implementations (I’m thinking of reST here), but it has at least one problem: it makes difficult my second requirement of bringing the brand-image to both formats.  Why?  Because “brand image” will be specified somewhere external to the content itself, in a format specific to content format X, and this will then have to be transformed into CSS on the HTML side, and whatever else on the way to PDF (e.g. through LaTeX, of which I am ignorant).  This is presumably possible, but hard.  The only easy (and sensible) way architecture no. 3 can get around this is to go by the [Format X -> HTML -> PDF] route — but in this case we’re back to implementing architecture no. 1.

So, my iron logic dictates that the way of distributing content as both HTML and branding-aware PDF is to use an HTML -> PDF converter.

Now, searching for this will bring you examples of software making valiant attempts to implement all of HTML and CSS, including the CSS2 and 3 additions for paged media.  I know of three converters of this type:

  • Pisa, or XHTML2PDF.  Written in Python, under GPL or commercial license.
  • dompdf, “a (mostly) CSS 2.1-compliant HTML to PDF converter.”  Written in PHP, licensed under LGPL.
  • Prince XML.  Closed source and bloody expensive, though a free watermarking version is available.

I’m going to dismiss Pisa: my (quickly aborted) experience with it has been awful (every block element was placed in a visible box; loads of lines of CSS were mis-interpreted).  Of the dompdf library, I’ve actually used it to good effect when writing in PHP.  A couple of problems with it: I’m not a big fan of PHP (not well-suited to my ideal usage of a converter on the command line), and some things it couldn’t handle so well: in particular, pagination (with tables), the crucial ability of anything converting to a page-oriented format.

The third, Prince XML, is pretty good at what it does.  However, the free version I’ve tried out places its logo on the first page.  I don’t want this blemish on my lovely-crafted documents; nor do I want to waste my precious printer ink.  I considered hacking away the watermark: firstly inserting an extra first page via CSS, then slicing off the first page of the PDF, but this fudges things like page numbering; secondly by programmatically removing the watermark, but this came to nil too (I couldn’t figure out how to do it).  In any case, I don’t feel comfortable violating the license like that, and it’s an ugly solution.  Also, did I mention that Prince XML is bloody expensive?

It took me quite a while to realise what is now an obvious point: all these programs are reinventing the wheel.  They are literally implementing their own browser and its ability to print to PDF.  Most of you can do all that right now: File > Print > Print to File, depending on your browser.  The happy realisation is that the bulk of the work needed for our task is right here, hidden away in our open-source browsers.

But let me re-cap exactly what I want to be able to do:

eegg@pc:~ ls
eegg@pc:~ cat foo.htm | html2pdf > foo.pdf
eegg@pc:~ ls
foo.htm foo.pdf

This CLI program, html2pdf, is not so difficult to create, considering that it just has to harness the power of a browser engine.  Gecko, the Firefox engine, uses the cairo graphics library, which can produce PDFs.  It seems logical, therefore, that it could easily be harnessed (here’s one long request for that).  One attempt has been made at a plugin that allows you to order a PDF copy of the page when launching Firefox: cmdlnprint.  It works, with some fairly big shortcomings:

  • You have to have an installation of Firefox.
  • Despite being ordered from the command line, it launches a window before doing its thing.  When I first used this for a batch job, I had hundreds of windows open and my computer ground to a halt.
  • It’ll interact with your ordinary Firefox profile — if you set your paper size to A5 for a print job in your lunch hour, your later batch jobs will use the same setting.  Yes, you can create a separate profile for the conversion job, but my experience of this hasn’t been good.
  • Firefox doesn’t seem to have very good support for CSS when it comes to printing.  It doesn’t seem to play well with paper sizes, print margins, and other things.

So, what engines will work well?  Let’s have a look at some cross-browser tests for HTML5 and CSS3.  Flying ahead is Safari, the Mac browser.  Before you lament, “I’m not on a Mac!”, these results are those of WebKit, the engine behind Safari, and WebKit is open source and free.  So where is WebKit outside of Safari?  WebKit requires a widget toolkit in order to run, and the two toolkits in my world are Qt and GTK.  These both have projects at integrating WebKit; respectively, QtWebKit and WebKitGTK.

Herein lies the answer.  I stumbled across a project which seemingly has next to no publicity.  Without further ado, it’s wkhtmltopdf.  That’s WebKit HTML to PDF.  What’s more, if you’re on a Debian system, it’s in the repository.

How good is it?  I tested it against the HTML+CSS in an article a A List Apart taking about using of Prince XML.  Compared to the output that Prince XML created from it, it’s fairly impressive.