Consider decoding URLs if you’re just storing them for attribution

While working on a new attribution library (more on that later!) it occurred to me that we waste a lot of space treating URL-like strings as if they’re too precious to decode.

I’m deliberately using “URL-like” and not “URL” because there’s kind of epistemological question involved: What is a URL? Or perhaps When is a URL?

URLs are ASCII-only

Remember (hoping you do!) that URLs can only contain visible ASCII characters — capital A-Z, lowercase a-z, numeric 0-9 and a handful of symbols. (And some of those characters need to be %-encoded, depending on how you use them.)

More important, all characters outside the ASCII range must be %-encoded.

So a common letter like accented é (%C3%A9), everyday symbols like © and ™ (%C2%A9 and %E2%84%A2 respectively), and every non-Latin character can’t appear in human-readable form.

Percent-encoding is wildly wasteful (but usually unavoidable)

While it’s great that there’s an established way to include any character, percent-encoding takes up a crazy amount of space.

Take the trademark ™ above. In its shortest encoded form[1], ™ is packed into just 2 bytes, but leaving that aside, in more common UTF-8, it takes 3 bytes. That is, it would only take 3 bytes to send that symbol over the internet... anyplace other than in an URL.

But in the percent-encoded form mandated by URLs, it takes 9 bytes! The percent-encoded sequence %E2%84%A2 is a simple ASCII string. Each character takes only one byte. But there are nine of ’em, creating 200% overhead on the wire.

It’s not just about consuming bandwidth, either. When you store full URLs in databases, they’re taking up that much more space, permanently.

When is a URL no longer a URL?

The Big Question: must we always treat a value that happened to be sent by a browser at some point as if it still needs to be a valid URL? Even if we’re never putting it back on the wire? Can it not revert to a URL-like string?

I was thinking about this code snippet a lot of people use (me included) to add more context to a Marketo form post:

MktoForms2.whenReady(function(mktoForm){
  mktoForm.addHiddenFields({
    lastMarketoFormURL : document.location.href,
    lastMarketoFormReferrerURL : document.referrer
  });
});

This code straightforwardly adds the current page (the page with the form) and the previous page (referrer, as available) to the form payload. It keeps the original percent-encoded URLs.

So say someone browsed your upcoming events and clicked on an upcoming speech by Ai Weiwei. The URL they clicked looked like this:

<a href="https://eventcatalog.example.com/?artist=%E8%89%BE%E6%9C%AA%E6%9C%AA">艾未未 (Ai Weiwei)</a>

The browser sent that exact href to the server, but it displayed the friendlier Chinese characters in the Location bar:

What you’d see in Marketo, with the above Forms 2.0 code, is:

But is that most appropriate? Wouldn’t it be at least as informative (and much more informative for a reader of Chinese) to see this:

And if you’re doing a Contains match in a Smart List, wouldn’t it make more sense to paste the Chinese characters? (Note the percent-encoded form doesn’t match the graphical form, nor vice versa. They’re different strings.)

Obviously, I’m thinking Yes: unless you have a compelling reason to the contrary, if you’re only storing a thing that once was a URL, you should be decoding it first. It saves space, is better for performance, and is more readable. (Here, the encoded value is 68 bytes long, decoded only 50 bytes — a 26% savings.)

Decoding URLs in Forms 2.0 JS

JavaScript has a built-in method decodeURI that’s perfect for this:

MktoForms2.whenReady(function(mktoForm){
  mktoForm.addHiddenFields({
    lastMarketoFormURL : decodeURI(document.location.href),
    lastMarketoFormReferrerURL : decodeURI(document.referrer)
  });
});

Interestingly, in too many years to count, I’ve never had occasion to use decodeURI before![2] (It’s not the same as decodeURIComponent, which I use constantly.)

A decoded URL is still a valid IRI

Internationalized Resource Identifier (IRI) is a standardized format that essentially means “URL/URI but with international characters left intact instead of encoded.”

So these are both valid IRIs:

https://eventcatalog.example.com/?artist=艾未未

https://eventcatalog.example.com/?artist=艾未未&event=Q%26A

Note the second one still has a percent-encoded reserved character, but the international Chinese characters are left intact. You could also choose to encode the Chinese characters and still have a valid IRI. The key is it’s more permissive than URI/URL syntax.

There are a variety of reasons that IRIs can’t replace URLs in the world at large, but they are a thing. So we’re not inventing a new format.

Notes

[1] That is, UTF-16.

[2] decodeURI ignores reserved ASCII characters, which makes it the wrong choice when you’re trying to decode params and values. But it works here.

← TEKNKL :: Blog