13 April 2009

Introducing rel="shortlink" - a better alternative to URL shorteners

Yesterday I wrote rather critically about a surprisingly successful drive to implement a deprecated "rev" relationship. This developed virtually overnight in response to the growing "threat" (in terms of linkrot, security, etc.) of URL shorteners including tinyurl.com, bit.ly and their ilk.

The idea is simple: allow the sites to specify short URLs in the document/feed itself, either automatically ([compressed] unique identifier, timestamp, "initials" of the title, etc.) or manually (using a human-friendly slug). That way, when people need to reference the URL in a space constrained environment (e.g. microblogging like Twitter) or anywhere they need to be manually entered (e.g. printed or spoken) they can do so in a fashion that will continue to work so long as the target does and which reveals information about the content (such as its owner and a concise handle).

Examples of such short URLs include:
  • http://example.com/123
  • http://example.com/3R (123 base32 encoded)
  • http://example.com/csrf (title "initials")
  • http://example.com/promo (manual slug)
The idea is sound but the proposed implementation is less so. There is (or at least was) provision for "rev"erse link references but these have been deprecated in HTML 5. There is also a way of hinting the canonical URI by specifying a rel="canonical" link. This makes a lot of sense because often the same document can be referred to by an infinite number of URIs (e.g. in search results, with sort orders, aliases, categories, etc.). Combine the two and you've got a way of saying "I am the canonical URI and this other URI happens to point at me too", only it can only ever (safely) work for the canonical URL itself and it doesn't make sense to list one arbitrary URL when there could be an infinite number.

Another suggestion was to use rel="alternate shorter" but the problem here is that the content should be identical (except for superficial formatting changes such as highlighting and sort order), while "alternate" means "an alternate version of the resource" itself - e.g. a PDF version. Clients that understand "alternate" versions shoult not list the short URL as the content itself is (usually) the same.

Ben Ramsay got closest to the mark with A rev="canonical" rebuttal but missed the "alternate" problem above, nonetheless suggesting a new rel="shorter" relation. Problem there is the "short" URI is not guaranteed to be "shortest" or indeed even "shorter" - it still makes sense, for example, to specify a "short" URI of http://example.com/promo to a user viewing http://example.com/123 because the longer "short" URI conveys information about the content in addition to its host.

Accordingly I have updated WHATWG RelExtensions and will shortly submit the following to the IANA IESG for addition to the Atom Link Relations registry:
Value:
shortlink (http://purl.org/net/shortlink)

Description:
A short URI that refers to the same document.

Expected Display Characteristics:
This relation may be used as a concise reference to the document. It will
typically be shorter than other URIs (including the canonical URI) and may
rely on a [compressed] unique identifier or a human readable slug. It is
useful for space constrained environments such as email and microblogs as
well as for URIs that need to be manually entered (e.g. printed, spoken).
The referenced document may differ superficially from the original (e.g.
sort order, highlighting).

Security Considerations:
Automated agents should take care when this relation crosses administrative domains (e.g., the URI has a different authority than the current document). Such agents should also avoid circular references by resolving only once.
Note that in the interim "http://purl.org/net/shortlink" can be used. Bearing in mind that you should be liberal in what you accept, and conservative in what you send, servers should use the interim identifier for now and clients should accept both. Nobody should be accepting or sending rev="canonical" or rel="alternate shorter" given the problems detailed above.

Update: It seems there are still a few sensible people out there, like Robert Spychala with his Short URL Auto-Discovery document. Unfortunately he proposes a term with an underscore (short_url) when it should be a space and causes the usual URI/URL confusion. Despite people like Bernhard Häussner claiming that "short_url is best, it's the only one that does not sound like shortened content", I don't get this reading from a "short" link... seems pretty obvious to me and you can always still use relations like "abstract" for that purpose. In any case it's a valid argument and one that's easily resolved by using the term "shortcutlink" instead (updated accordingly above). Clients could fairly safely use any link relation containing the string "short".

Update: You can follow the discussion on Twitter at #relshortcut, #relshort and #revcanonical.

Update: I forgot to mention again that the HTTP Link: header can be used to allow clients to find the shortlink without having to GET and parse the page (e.g. by doing a HEAD request):

Link: <http://example.com/promo> rel="shortlink"
 
Update: Both Andy Mabbett and Stan Vassilev also independently suggested rel=shortcut, which leads me to believe that we're on a winner. Stan adds that we've other things to consider in addition to the semantics and Google's Matt Cutts points out why taking rather than giving canonical-ness (as in RevCanonical) is a notoriously bad idea.

Update: Thanks to the combination of Microsoft et al recommending the use of "shortcut icon" for favicon.ico (after stuffing our logs by quietly introducing this [mis]feature) and HTML link types being a space separated list (thanks @amoebe for pointing this out - I'd been looking at the Atom RFCs and assuming they used the single link type semantics), the term "shortcut" is effectively scorched earth. Not only is there a bunch of sites that already have "shortcut" links (even if the intention was that "shortcut icon" be atomic), but there's a bunch of code that looks for "shortcut", "icon" or "shortcut icon". FWIW HTML 5 specifies the "icon" link type. Moral of the story: get consensus before implementing code.

As I still have problems with the URI/URL confusion (thus ruling out "shorturl") but have come around to the idea that this should be a noun rather than an adjective, I now propose "shortlink" as a suitable, self-explanatory, impossible-to-confuse term.

Update: I've created a shortlink Google Group and kicked off a discussion with a view to reaching a consensus. I've also created a corresponding Google Code project and modified the shorter links Wordpress plugin to implement shortlinks.

4 comments:

  1. I suggest replacing:

    Security Considerations:
    Automated agents should take care when this relation crosses administrative domains (e.g., the URI has a different authority than the current document).

    with:

    Security Considerations:
    Automated agents should take care when this relation crosses administrative domains (e.g., the URI has a different authority than the current document) that a reciprocal rel="canonical" attribute is in place; otherwise he claim made in rel="shortcut" MAY be discarded.

    (This will require URL shortening services to include a link with a rel="canonical" attribute. Perhaps "MAY" would be better as "SHOULD")

    ReplyDelete
  2. Ok so I'm thinking about this but I'm not sure - it should be very easy (think HTTP HEAD) to resolve a shortcut URL from another... what risk exactly is it you're trying to mitigate here?

    ReplyDelete
  3. Allowing, say, TinyURL to say that my page, which it is making a shortcut, is canonical, and me to say that that's correct from my page. All well and good when the shortener is well known, but when it's not..? Some (including Matt Cutts; though his concern also applies to different sites on the same domain) have apparently expressed concerns that there could be bogus claims, perhaps as part of SEO-hijacking.

    Also allows me to say that a version on my second domain is canonical, to a page on my first - that's not currently accepted by Google et al.

    ReplyDelete
  4. Right but with RelShortcut nobody's claiming anything about canonical-ness. Maybe I've missed some inherent evil here but the only risk that stands out is that the URL redirector loses its mind/domain or deliberately redirects my traffic...

    Sam

    ReplyDelete

Note: only a member of this blog may post a comment.