Big brands, bad code, Part 1 of ∞: A bad URL parser breaks Munchkin

In my Marketo consulting gigs, I've fixed Munchkin and forms issues for some huge brands. Hard to shock me at this point, but occasionally a high-traffic site is so broken that I'm in awe of what people/software can get away with.

Look, no grown-up should be surprised to see international B2C sites with hundreds of external assets, sloppy-but-working JS, and so on. Can't get worked up about that stuff because of the mind-boggling amount of content those folks have to deal with. That's not what I'm talking about: this is something deeply broken as opposed to merely suboptimal.

This one's a massive, Top-40-shouted-out liquor brand. A Community poster wondered why clicks on Marketo emails were resulting in 404 errors once the lead arrived on the corporate site. (The Click Email event was logged by the tracking server as expected, but the subsequent page view was a 404.)

A little poking showed that Munchkin tracking was broadly incompatible with the site's buggy HTTP framework, requiring their Marketo consultants to implement a clumsy and costly workaround (and that cost makes both parties on the Marketo side look bad, even though it's not their fault).

I feel like the bug would've been quickly squashed in the 90s-mid 2000s, when devs actually knew how HTTP worked. So it's likely modern software that's still maintained by someone. Unfortunately, the outside Marketo guys can't pressure the web team to look into the problem due to the usual shadow-IT/real-IT segregation. And, in fairness, I doubt the web team wrote or even has access to the broken code themselves: could be part of some commercial CMS that could take ages to fix. But none of those factors change the conclusion Wow, this really shouldn't be in production.

Here's why deploying Munchkin has revealed how badly this framework is broken.

By default, when Marketo redirects a link via its tracking server (your branding domain), it also appends the query parameter mkt_tok to the final URL. This param is critical: it's used by Munchkin to relay the lead's identity (without revealing PII) and to log their subsequent web activities under the known lead. (I can talk about why it's necessary in another post. Suffice it to say you must have the mkt_tok passed to your site if you care about your leads' views and clicks.)

The value of mkt_tok is a non-reversible (we believe) hash of the lead ID + the campaign ID + the email ID + whatever else Marketo wants to track. (We aren't sure what's concatenated before hashing, but it clearly includes at least those data points.) A sample mkt_tok looks like this:

3ZkMMJWWfF9wsRohu6/JZKXonjHpfsX77O0pX6GxlMI/0FR3fOvrPUfGjI4ASMBkNq+TFAwABC5toziV8R7TELM141ccQXRbh

When you look at a range of these, it appears (though this is not documented) that they contain internal delimiters like / and +. In other words, those characters aren't generated by the hash/random algorithm but are added afterward to set off parts of the result. That's totally fine, no problem as far as I see it (if they were randomly generated, that would be fine, too). Long as Marketo knows the output format and what chars are reserved, they can losslessly compare/lookup the original input. I've done the same thing many times. That's why encoding exists: to let you pass opaque data via intermediate systems, without the systems accidentally thinking there's anything in there of interest.

And those characters are also totally meaningless in a query string, as long as they're URL-encoded. In other words, as long as aaa/bbb is encoded as aaa%2Fbbb there's no reason for a bug-free web server or framework to confuse that with the URL pathname aaa/bbb. And no reason for a bug-free server to confuse aaa%2Bbbb with the standard space-encoding aaa+bbb (which means aaa bbb when it's decoded).

But "no reason" doesn't mean people won't screw up. And that's what the authors of this framework did. So badly that by simply adding a query param that includes %2B you can make a page magically 404. Yep, the homepage http://cantfeelmyface.com works fine and so does http://cantfeelmyface.com/?something=random (which also brings you to the homepage, since that query param has no meaning at the site). http://cantfeelmyface.com/?something+random is fine. But http://cantfeelmyface.com/?something=random%2Byeah? 404. SMDH.

Maybe some part of the framework tries to decode the %2B back to a literal +, then gets caught in a decoding loop and finally barfs out a 404?? I don't know, and don't care so much how they screwed it up. But just look at the consequences. All you needed to do was test against the HTTP spec and this never would have happened. Now, you've got consultants trying to explain things they shouldn't need to even think about. You've got a budget line for a workaround* to grudgingly approve.

I guess I don't have much of a moral here (maybe I'll edit the post later) except that money could have fed the hungry... or done a lot of other things. Even if you get paid for your time, taking the blame for broken stuff because it's too hard to explain or for political/hierarchical reasons just sucks.

* To be discussed later, maybe.