12 April 2009

rev=canonical considered harmful (complete with sensible solution)

Sites like http://tinyurl.com/ provide a very simple service: turning unwieldly but information rich URLs like http://samj.net/2009/04/open-letter-to-community-regarding-open.html into something more manageable like http://tinyurl.com/ceze29. This was traditionally useful for emails with some clients mangling long URLs but it also makes sense for URLs in documents, on TV, radio, etc. (basically anywhere a human has to manually enter it). Shorteners are a dime a dozen now - there's over 90 of them listed here alone... and I must confess to having created one at http://tvurl.com/ a few years back (the idea being that you could buy a TV friendly URL). Not a bad idea but there were other more important things to do at the time and I was never going to be able to buy my first island from the proceeds. Unfortunately though there are many problems with adding yet another layer of indirection and the repurcussions could be quite serious (bearing in mind even the more trustworthy sites tend to come and go).

So a while back I whipped up a thing called "springboard" for Google Apps/AppEngine (having got bored with maintaining text files for use with Apache's mod_rewrite) which allowed users to create redirect URLs like http://go.example.com/promo (and which was apparently a good idea because now Google have their own version called short links). This is the way forward - you can tell at a glance who's behind the link from the domain and you even get an idea of what you're clicking through to from the path (provided you're not being told fibs). When you click on this link you get flicked over to the real (long) URL with a HTTP redirect, probably a 301 which means "Moved Permanently", so the browsers know what's going on too. If your domain goes down then chances are the target will be out of action too (much the same story as with third-party DNS) so there's a lot less risk. It's all good news and if you're using a CMS like Drupal then it could be completely automated and transparent - you won't even know it's there and clients looking for a short URL won't have to go ask a third party for one.

So the problem is that nowdays you've got every man and his dog wanting to feed your nice clean (but long) URLs through the mincer in order to post them on Twitter. Aside from being a security nightmare (the resulting URLs are completely opaque, though now clients like Nambu are taking to resolving them back again!?!), it breaks all sort of things from analytics to news sites like Digg. Furthermore there are much better ways to achieve this. If you have to do a round trip to shorten the URL anyway, why not ask the site for a shorter version of its canonical URL (that being the primary or 'ideal' URL for the content - usually quite long and optimised for SEO)? In the case of Drupal at least every node has an ID so you can immediately boil URLs down to http://example.com/node/123, http://example.com/123 or even use something like base32 to get even shorter URLs like http://example.com/3R.

So how do we express this for the clients? The simplest way is to embed LINK tags into the HEAD section of the HTML and specify a sensible relation ("rel"). Normally these are used to specify alternative versions of the content, icons, etc. but there's nothing to say that for any given URL(s) the "short" url is e.g. http://example.com/3R. That's right, rel="short", not rel="alternate shorter" or other such rubbish ("alternate" refers to alternate *content*, usually in a different mime-type, not just an alternate URL - here the content is likely to be exactly the same). It can be performance optimised somewhat too by setting an e.g. X-Rel-Short header so that users (e.g. Twitter clients) can resolve a long URL to the preferred short URL via a HTTP HEAD request, without having to retrieve and parse the HTML.

Another even less sensible alternative being peddled by various individuals (and being discussed here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here and of course here) is [ab]using the rightly deprecated and confusing rev attribute ala rev="canonical". Basically this is saying "I am the authorative/canonical URL and this other URL happens to point here too", without saying anything whatsoever about the URL itself actually being short. There could be an infinite number of such inbound URLs and this only ever works for the one canonical URL itself. Essentially this idea is stillborn and I sincerely hope that when people come back to work next week it will be promptly put out of its misery.

So in summary someone's got carried away and started writing code (RevCanonical) without first considering all the implications. Hopefully they will soon realise this isn't such a great idea after all and instead get behind the proposal for rel="short" at the WHATWG. Then we can all just add links like this to our pages:

<link href="http://example.com/promo" rel="short">

Incidentally I say "short" and not "shorter" because the short URL may not in fact be the shortest URL for a given resource - "http://example.com/3R" could well also map back to the same page but the URL is meaningless. And I leave out "alternate" because it's not alternate content, rather just an alternate URL - a subtle but significant difference.

Let's hope sanity prevails...

Update: The HTTP Link: header is a much more sensible solution to the HTTP header optimisation:

Link: <http://example.com/promo>; rel="short"

5 comments:

  1. No offense, but belittling rev="canonical" proponents and running code are probably not effective ways to sway people or convince them to work with you. I'd suggest you spend some more time distilling the argument against without insulting people or their work.

    Or better yet, forget the syntax and spend time exploring the implications of the actual concept regardless of expression in HTML.

    ReplyDelete
  2. Ugh, and of course blogger gets my OpenID wrong.

    ReplyDelete
  3. The concept is sound, but the implementation is potentially extremely dangerous - a single character slip (rev vs rel) could knock an entire site off its perch. I think the thing that really got me was the "another 30 minutes or less" production gloat - it took me less than 30 minutes to grok the implications. While I appreciate that we live in times of accelerating change some amount of application of the precautionary principle couldn't hurt.

    Another thing I found particularly disconcerting about the completely unjustified perceived urgency of the situation was production deployments to sites like ars.technica, apparently by individuals who are quicker to engage their fingers than their noggins, without adequate change control and on a holiday weekend no less. Do we know that Google aren't going to blow up and dump the site at the next google dance? Yahoo? No, of course we don't but we could certainly have afforded to wait and see.

    Sam

    ReplyDelete
  4. I've temporarily put aside the rev=canonical on my site ( http://free.naplesplus.us ) and am sticking with the rel="shorturl shortlink ...etc etc" basing it upon php.net's idea but without the rev=canonical.

    I hear the for and against but until I get a better say-so from bigger authorities, I'm going to wait and see.

    Kenneth Udut

    ReplyDelete
  5. Hey thanks for the heads up re: php.net - as at right now they have rev="canonical" rel="self alternate shorter shorturl shortlink".

    So we already know why rev="canonical" is a clusterf--k but what about the rest of them?

    rel="self" is defined by Atom (RFC4287) as "the preferred URI for retrieving Atom Feed Documents representing this Atom feed". Clearly that is far more likely to be the canonical URL rather than some redirector to it.

    rel="alternate" signifies "an alternate version of the resource" (e.g. PDF, plain text), NOT the same thing.

    rel="shorter" implies that the resource itself is shorter (e.g. an abstract or short video) as relations refer to the resource, not the URLs.

    And then there's the only half sensible alternative, rel="shorturl", but that has many permutations (/short[_- ]ur[li]/) and actually started life as short_url.

    rel="shortlink" is obvious, self documenting and virtually impossible to get wrong.

    Sam

    ReplyDelete

Note: Only a member of this blog may post a comment.