rel=shortlink: url shortening that really doesn’t hurt the internet

Inspired primarily by the fact that the guys behind the RevCanonical fiasco are still stubbornly refusing to admit they got it wrong (the whole while arrogantly brushing off increasingly direct protests from the standards community) I’ve whipped up a Google App Engine application which reasonably elegantly implements rel=shortlink: url shortening that really doesn’t hurt the internet:

http://rel-shortlink.appspot.com

It works just like TinyURL and its ilk, accepting a URL and [having a crack at] shortening it. It checks both the response headers and (shortly) the HTML itself for rel=shortlink and if they’re not present then you have the option of falling back to a traditional service (the top half a dozen are preconfigured or you can specify your own via the API’s “fallback” parameter).

An interesting facet of this implementation is the warnings it gives if it encounters the similar-but-ambiguous short_url proposal and the fatal errors it throws up when it sniffs out the nothing-short-of-dangerous rev=canonical debacle. Apparently people (here’s looking at you Ars Technica and Dopplr) felt there was no harm in implementing these “protocols”. Now there most certainly is.

Here’s the high level details (from the page itself):

Who
A community service by Sam Johnston (@samj / s…@samj.net) of Australian Online Solutions, loosely based on a relatively good (albeit poorly executed) idea by some some web developers purporting to “save the Internet” while actually hurting it.
What
A mechanism for webmasters to indicate the preferred short URL(s) for a given resource, thereby avoiding the need to consult a potentially insecure/unreliable third-party for same. Resulting URLs reveal useful information about the source (domain) and subject (path):
http://tinyurl.com/cgy9pu » http://purl.org/net/shortlink
Where
The shortlink Google Code project, the rel-shortlink Google App Engine application, the #shortlink Twitter hashtag and coming soon to a client or site near you.
When
Starting April 2009, pending ongoing discussion in the Internet standards community (in the mean time you can also use http://purl.org/net/shortlink in place of shortlink).
Why
Short URLs are useful both for space constrained channels (such as SMS and Twitter) and also for anywhere URLs need to be manually entered (e.g. when they are printed or spoken). Third-party shorteners can cause many problems, including link rot, performance problems, outages and privacy & security issues.
How
By way of <link rel="shortlink"> HTML elements and/or Link: ; rel=shortlink HTTP headers.

So have at it and let me know what you think. The source code is available under the AGPL license for those who are curious as to how it works.

Introducing rel=”shortlink”: a better alternative to URL shorteners

Yesterday I wrote rather critically about a surprisingly successful drive to implement a deprecated “rev” relationship. This developed virtually overnight in response to the growing “threat” (in terms of linkrot, security, etc.) of URL shorteners including tinyurl.com, bit.ly and their ilk.

The idea is simple: allow the sites to specify short URLs in the document/feed itself, either automatically ([compressed] unique identifier, timestamp, “initials” of the title, etc.) or manually (using a human-friendly slug). That way, when people need to reference the URL in a space constrained environment (e.g. microblogging like Twitter) or anywhere they need to be manually entered (e.g. printed or spoken) they can do so in a fashion that will continue to work so long as the target does and which reveals information about the content (such as its owner and a concise handle).

Examples of such short URLs include:

The idea is sound but the proposed implementation is less so. There is (or at least was) provision for “rev”erse link references but these have been deprecated in HTML 5. There is also a way of hinting the canonical URI by specifying a rel=”canonical” link. This makes a lot of sense because often the same document can be referred to by an infinite number of URIs (e.g. in search results, with sort orders, aliases, categories, etc.). Combine the two and you’ve got a way of saying “I am the canonical URI and this other URI happens to point at me too”, only it can only ever (safely) work for the canonical URL itself and it doesn’t make sense to list one arbitrary URL when there could be an infinite number.

Another suggestion was to use rel=”alternate shorter” but the problem here is that the content should be identical (except for superficial formatting changes such as highlighting and sort order), while “alternate” means “an alternate version of the resource” itself – e.g. a PDF version. Clients that understand “alternate” versions shoult not list the short URL as the content itself is (usually) the same.

Ben Ramsay got closest to the mark with A rev=”canonical” rebuttal but missed the “alternate” problem above, nonetheless suggesting a new rel=”shorter” relation. Problem there is the “short” URI is not guaranteed to be “shortest” or indeed even “shorter” – it still makes sense, for example, to specify a “short” URI of http://example.com/promo to a user viewing http://example.com/123 because the longer “short” URI conveys information about the content in addition to its host.

Accordingly I have updated WHATWG RelExtensions and will shortly submit the following to the IANA IESG for addition to the Atom Link Relations registry:

Value:
shortlink (http://purl.org/net/shortlink)

Description:
A short URI that refers to the same document.

Expected Display Characteristics:
This relation may be used as a concise reference to the document. It will
typically be shorter than other URIs (including the canonical URI) and may
rely on a [compressed] unique identifier or a human readable slug. It is
useful for space constrained environments such as email and microblogs as
well as for URIs that need to be manually entered (e.g. printed, spoken).
The referenced document may differ superficially from the original (e.g.
sort order, highlighting).

Security Considerations:
Automated agents should take care when this relation crosses administrative domains (e.g., the URI has a different authority than the current document). Such agents should also avoid circular references by resolving only once.

Note that in the interim “http://purl.org/net/shortlink” can be used. Bearing in mind that you should be liberal in what you accept, and conservative in what you send, servers should use the interim identifier for now and clients should accept both. Nobody should be accepting or sending rev=”canonical” or rel=”alternate shorter” given the problems detailed above.

Update: It seems there are still a few sensible people out there, like Robert Spychala with his Short URL Auto-Discovery document. Unfortunately he proposes a term with an underscore (short_url) when it should be a space and causes the usual URI/URL confusion. Despite people like Bernhard Häussner claiming that “short_url is best, it’s the only one that does not sound like shortened content“, I don’t get this reading from a “short” link… seems pretty obvious to me and you can always still use relations like “abstract” for that purpose. In any case it’s a valid argument and one that’s easily resolved by using the term “shortcutlink” instead (updated accordingly above). Clients could fairly safely use any link relation containing the string “short”.

Update: You can follow the discussion on Twitter at #relshortcut, #relshort and #revcanonical.

Update: I forgot to mention again that the HTTP Link: header can be used to allow clients to find the shortlink without having to GET and parse the page (e.g. by doing a HEAD request):

Link: <http://example.com/promo> rel="shortlink"

Update: Both Andy Mabbett and Stan Vassilev also independently suggested rel=shortcut, which leads me to believe that we’re on a winner. Stan adds that we’ve other things to consider in addition to the semantics and Google’s Matt Cutts points out why taking rather than giving canonical-ness (as in RevCanonical) is a notoriously bad idea.

Update: Thanks to the combination of Microsoft et al recommending the use of “shortcut icon” for favicon.ico (after stuffing our logs by quietly introducing this [mis]feature) and HTML link types being a space separated list (thanks @amoebe for pointing this out – I’d been looking at the Atom RFCs and assuming they used the single link type semantics), the term “shortcut” is effectively scorched earth. Not only is there a bunch of sites that already have “shortcut” links (even if the intention was that “shortcut icon” be atomic), but there’s a bunch of code that looks for “shortcut”, “icon” or “shortcut icon”. FWIW HTML 5 specifies the “icon” link type. Moral of the story: get consensus before implementing code.

As I still have problems with the URI/URL confusion (thus ruling out “shorturl”) but have come around to the idea that this should be a noun rather than an adjective, I now propose “shortlink” as a suitable, self-explanatory, impossible-to-confuse term.

Update: I’ve created a shortlink Google Group and kicked off a discussion with a view to reaching a consensus. I’ve also created a corresponding Google Code project and modified the shorter links WordPress plugin to implement shortlinks.

rev=”canonical” considered harmful (complete with sensible solution)

Sites like http://tinyurl.com/ provide a very simple service: turning unwieldly but information rich URLs like https://samj.net/2009/04/open-letter-to-community-regarding-open.html into something more manageable like http://tinyurl.com/ceze29. This was traditionally useful for emails with some clients mangling long URLs but it also makes sense for URLs in documents, on TV, radio, etc. (basically anywhere a human has to manually enter it). Shorteners are a dime a dozen now – there’s over 90 of them listed here alone… and I must confess to having created one at http://tvurl.com/ a few years back (the idea being that you could buy a TV friendly URL). Not a bad idea but there were other more important things to do at the time and I was never going to be able to buy my first island from the proceeds. Unfortunately though there are many problems with adding yet another layer of indirection and the repurcussions could be quite serious (bearing in mind even the more trustworthy sites tend to come and go).

So a while back I whipped up a thing called “springboard” for Google Apps/AppEngine (having got bored with maintaining text files for use with Apache’s mod_rewrite) which allowed users to create redirect URLs like http://go.example.com/promo (and which was apparently a good idea because now Google have their own version called short links). This is the way forward – you can tell at a glance who’s behind the link from the domain and you even get an idea of what you’re clicking through to from the path (provided you’re not being told fibs). When you click on this link you get flicked over to the real (long) URL with a HTTP redirect, probably a 301 which means “Moved Permanently”, so the browsers know what’s going on too. If your domain goes down then chances are the target will be out of action too (much the same story as with third-party DNS) so there’s a lot less risk. It’s all good news and if you’re using a CMS like Drupal then it could be completely automated and transparent – you won’t even know it’s there and clients looking for a short URL won’t have to go ask a third party for one.

So the problem is that nowdays you’ve got every man and his dog wanting to feed your nice clean (but long) URLs through the mincer in order to post them on Twitter. Aside from being a security nightmare (the resulting URLs are completely opaque, though now clients like Nambu are taking to resolving them back again!?!), it breaks all sort of things from analytics to news sites like Digg. Furthermore there are much better ways to achieve this. If you have to do a round trip to shorten the URL anyway, why not ask the site for a shorter version of its canonical URL (that being the primary or ‘ideal’ URL for the content – usually quite long and optimised for SEO)? In the case of Drupal at least every node has an ID so you can immediately boil URLs down to http://example.com/node/123, http://example.com/123 or even use something like base32 to get even shorter URLs like http://example.com/3R.

So how do we express this for the clients? The simplest way is to embed LINK tags into the HEAD section of the HTML and specify a sensible relation (“rel”). Normally these are used to specify alternative versions of the content, icons, etc. but there’s nothing to say that for any given URL(s) the “short” url is e.g. http://example.com/3R. That’s right, rel=”short”, not rel=”alternate shorter” or other such rubbish (“alternate” refers to alternate content, usually in a different mime-type, not just an alternate URL – here the content is likely to be exactly the same). It can be performance optimised somewhat too by setting an e.g. X-Rel-Short header so that users (e.g. Twitter clients) can resolve a long URL to the preferred short URL via a HTTP HEAD request, without having to retrieve and parse the HTML.

Another even less sensible alternative being peddled by various individuals (and being discussed here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here, here and of course here) is [ab]using the rightly deprecated and confusing rev attribute ala rev=”canonical”. Basically this is saying “I am the authorative/canonical URL and this other URL happens to point here too”, without saying anything whatsoever about the URL itself actually being short. There could be an infinite number of such inbound URLs and this only ever works for the one canonical URL itself. Essentially this idea is stillborn and I sincerely hope that when people come back to work next week it will be promptly put out of its misery.

So in summary someone’s got carried away and started writing code (RevCanonical) without first considering all the implications. Hopefully they will soon realise this isn’t such a great idea after all and instead get behind the proposal for rel=”short” at the WHATWG. Then we can all just add links like this to our pages:

<link href=”http://example.com/promo&#8221; rel=”short”>

Incidentally I say “short” and not “shorter” because the short URL may not in fact be the shortest URL for a given resource – “http://example.com/3R&#8221; could well also map back to the same page but the URL is meaningless. And I leave out “alternate” because it’s not alternate content, rather just an alternate URL – a subtle but significant difference.

Let’s hope sanity prevails…

Update: The HTTP Link: header is a much more sensible solution to the HTTP header optimisation:

Link: <http://example.com/promo>; rel="short"