You gotta know when to ’encode em, know when to... not encode ’em

A technical marketer needs to truly understand ASCII-to-Unicode encodings — at the very least including URL encoding (%NN, a.k.a. percent encoding) and HTML encoding (&charref;/&#NNNN;, a.k.a. ampersand encoding).

And ideally, Q encoding (=?...=NN?=, a.k.a. subject line encoding) too.

But you can't just encode everything “just in case.” Both failing to encode when you should and encoding when you shouldn't can cause emails, pages, and forms to fail catastrophically.

That erroneously encoding is harmful came up in a recent Marketo Nation thread. A user had copied some JS that originally had a line like this:

myButton.onclick = function(){
   document.location.href = "";

But for some reason, their CMS (not Marketo) decided to “fix it up” when it was pasted into their editor:

myButton.onclick = function(){
   document.location.href = ";that"; 

As you can see, the & has been replaced by its HTML character reference, &amp;. That particular example of HTML-encoding is extremely well known, only rivaled by &nbsp;.

But there's one problem.

You don't HTML-encode inside a <script> tag. Not “don't have to” — just “don't.” If you do, you're simply changing the value of the string: the URL now has the 5 literal characters &-a-m-p-; in it, like if you typed this in the location bar:

That's bad! Havoc ensues: the query string is broken, wrong asset or follow-up page is loaded, attribution is wrong, and so on.

I chose my words wisely above, though. I was about to write “You don't HTML-encode inside JavaScript.” But that isn't true — or is at best so hyper-precise that it isn’t useful.

Yes, it’s true that you don’t HTML-encode inside local <script> blocks, nor inside remote JS in a <script src>. But you might well need to encode in other contexts that produce JS code.

Where else, inside an HTML doc, do you find JS code? Good ol’ inline event handler attributes.

Remember inline event handlers?

We strive not to use them anymore (in favor of addEventListener) but Event Handler Content Attributes are still supported for legacy reasons.

These are the single-purpose events you add like:

<div onclick="SomeSpecialAction();">Do Something</div>
<a href="/somewhere" onclick="ActuallyGoSomewhereElse(); return false;">Go Somewhere</a>

Those onclick values eventually are run as JavaScript code, but first they go through the HTML attribute parser. Which means they support HTML-encoding, like any other attribute.

Let’s go to the spec

The HTML5 spec distinguishes what’s called the Script Data state from the (default) Data state and the Attribute Value state.

See, like any tokenizer, as a browser ingests your HTML, it switches between different states based on the sequence of characters it encounters.

As you can see at the links above, in the Data state, when a & character is encountered, the state immediately switches to the Character Reference state — that state in which valid HTML-encoded sequences are transformed into corresponding Unicode characters. Same the Attribute Value state, which also switches to Character Reference upon seeing a &.

But in the Script Data state, i.e. reading the contents of a <script> block, no such switch occurs. The & is not special in Script Data.

And that’s why arbitrarily HTML-encoding because you think “This is all HTML, what can it hurt?” breaks stuff. Always HTML-encode when it’s necessary, and never HTML-encode when it’s not.

Back to the mistake at hand

To revisit the mistake that led to this post, it was mistakenly HTML-encoding inside this <script>:

myButton.onclick = function(){
   document.location.href = ";that"; 

However, let’s imagine if the redirect logic had been in an event handler:

<button onclick="document.location.href='';">Go Somewhere</button>

In this case, it would have been not only safe but correct to encode the value, since it’s an attribute:

<button onclick="document.location.href=';that';">Go Somewhere</button>

Tricky, no?