While working on a new attribution library (more on that later!) it occurred to me that we waste a lot of space treating URL-like strings as if they’re too precious to decode.
I’m deliberately using “URL-like” and not “URL” because there’s kind of epistemological question involved: What is a URL? Or perhaps When is a URL?
URLs are ASCII-only
Remember (hoping you do!) that URLs can only contain visible ASCII characters — capital A-Z
, lowercase a-z
, numeric 0-9
and a handful of symbols. (And some of those characters need to be %
-encoded, depending on how you use them.)
More important, all characters outside the ASCII range must be %
-encoded.
So a common letter like accented é
(%C3%A9
), everyday symbols like ©
and ™
(%C2%A9
and %E2%84%A2
respectively), and every non-Latin character can’t appear in human-readable form.
Percent-encoding is wildly wasteful (but usually unavoidable)
While it’s great that there’s an established way to include any character, percent-encoding takes up a crazy amount of space.
Take the trademark ™
above. In its shortest encoded form[1], ™
is packed into just 2 bytes, but leaving that aside, in more common UTF-8, it takes 3 bytes. That is, it would only take 3 bytes to send that symbol over the internet... anyplace other than in an URL.
But in the percent-encoded form mandated by URLs, it takes 9 bytes! The percent-encoded sequence %E2%84%A2
is a simple ASCII string. Each character takes only one byte. But there are nine of ’em, creating 200% overhead on the wire.
It’s not just about consuming bandwidth, either. When you store full URLs in databases, they’re taking up that much more space, permanently.
When is a URL no longer a URL?
The Big Question: must we always treat a value that happened to be sent by a browser at some point as if it still needs to be a valid URL? Even if we’re never putting it back on the wire? Can it not revert to a URL-like string?
I was thinking about this code snippet a lot of people use (me included) to add more context to a Marketo form post:
MktoForms2.whenReady(function(mktoForm){
mktoForm.addHiddenFields({
lastMarketoFormURL : document.location.href,
lastMarketoFormReferrerURL : document.referrer
});
});
This code straightforwardly adds the current page (the page with the form) and the previous page (referrer, as available) to the form payload. It keeps the original percent-encoded URLs.
So say someone browsed your upcoming events and clicked on an upcoming speech by Ai Weiwei. The URL they clicked looked like this:
<a href="https://eventcatalog.example.com/?artist=%E8%89%BE%E6%9C%AA%E6%9C%AA">艾未未 (Ai Weiwei)</a>
The browser sent that exact href
to the server, but it displayed the friendlier Chinese characters in the Location bar:
What you’d see in Marketo, with the above Forms 2.0 code, is:
But is that most appropriate? Wouldn’t it be at least as informative (and much more informative for a reader of Chinese) to see this:
And if you’re doing a Contains match in a Smart List, wouldn’t it make more sense to paste the Chinese characters? (Note the percent-encoded form doesn’t match the graphical form, nor vice versa. They’re different strings.)
Obviously, I’m thinking Yes: unless you have a compelling reason to the contrary, if you’re only storing a thing that once was a URL, you should be decoding it first. It saves space, is better for performance, and is more readable. (Here, the encoded value is 68 bytes long, decoded only 50 bytes — a 26% savings.)
Decoding URLs in Forms 2.0 JS
JavaScript has a built-in method decodeURI
that’s perfect for this:
MktoForms2.whenReady(function(mktoForm){
mktoForm.addHiddenFields({
lastMarketoFormURL : decodeURI(document.location.href),
lastMarketoFormReferrerURL : decodeURI(document.referrer)
});
});
Interestingly, in too many years to count, I’ve never had occasion to use decodeURI
before![2] (It’s not the same as decodeURIComponent
, which I use constantly.)
A decoded URL is still a valid IRI
Internationalized Resource Identifier (IRI) is a standardized format that essentially means “URL/URI but with international characters left intact instead of encoded.”
So these are both valid IRIs:
https://eventcatalog.example.com/?artist=艾未未
https://eventcatalog.example.com/?artist=艾未未&event=Q%26A
Note the second one still has a percent-encoded reserved character, but the international Chinese characters are left intact. You could also choose to encode the Chinese characters and still have a valid IRI. The key is it’s more permissive than URI/URL syntax.
There are a variety of reasons that IRIs can’t replace URLs in the world at large, but they are a thing. So we’re not inventing a new format.
Notes
[1] That is, UTF-16.
[2] decodeURI
ignores reserved ASCII characters, which makes it the wrong choice when you’re trying to decode params and values. But it works here.