Detecting the language used in a Marketo form fillout, Part I

When it comes to spam, every business is international, since botnets span the globe. And when bots speak an unfamiliar tongue (and you aren't using ReCAPTCHA to stop bots in the first place) you can have a surprisingly tough time filtering out their malicious form fillouts, even though they seem glaringly out-of-place.

Since humans can identify individual bad form posts with excellent accuracy (in this case, fields that we literally don't know how to read) machines will be even better, right? Not so: weeding out “clearly bad” leads can be surprisingly difficult, because what's clearly wrong for your company looks fine within Marketo's capabilities.

Take email addresses (and other form fields) with Chinese characters. If your company never does business in China or Taiwan, you know these leads aren't legit (and indeed, they were probably submitted via bots in hopes of spreading backlink spam). But Marketo can't know this: since forms can submit Unicode codepoints in the CJK ranges, it stands to reason that you may be able to read those fields.[1]

If you could locate these leads by the characters used, you could get rid of 'em. Marketo doesn't have built-in language or character detection features, but there are a couple of easy-to-integrate services that can help.

The almost right (and still fascinating) method: Google Translate API

First, we're going to look at an extremely accurate method of language detection for Marketo field values… then we're going to shoot it down because it's not the character/glyph detection that we need for anti-spam purposes.

(If it seems weird to give time to an interesting technology that won't work, that's because I originally though it did work and would be the main thrust of this post!)

Marketo doesn't have a language detection feature, but luckily the Google Translate Language Detection API is really easy to call as a Marketo webhook.[2]

The webhook setup

I won't show the Google-side setup today (it's just generating an API key in their Control Panel) but let's look at how easy it is to set up the webhook side once you have the key.

You set up a webhook that POSTs JSON to https://translation.googleapis.com/language/translate/v2/detect. In the querystring you add ?key= and your key. The payload is a simple JSON object with one or more q (for “query”) properties:

ss

In this case I'm passing the {{lead.Last Name}} token. You can pass any {{Lead.,}} {{Company.}}, or {{My.}} token you want.

Here's an example response viewed in Postman:

ss

[To check multiple fields at once, you pass — and this is totally bizarre — multiple q properties. I cannot emphasize enough how head-spinningly strange this multiple query method is. Can you see the 2 things that are illogical about this method? Tell me in the comments!

Then you set a simple response mapping to a custom field if you want:

ss

The response mapping is actually optional. It allows you to permanently store the detected language on the lead (which you might use for other purposes, if you're not only deleting bot-generated leads). If you're just deleting, or moving leads into a list for a last check before deletion, you don't need to store the language on the lead: you can use the Webhook is Called trigger instead, as noted below.

Putting it together

Now that you've got the webhook set up with a valid API key, you'll need a typical sequence of batch+trigger campaigns to call the 'hook for your “interesting leads.” As always, be conservative with your webhook calls and don't waste resources when there won't be any interesting data. Constrain your campaigns by people who have Filled Out Form, for one thing, and also (since you're probably dealing with a storm of bots after a certain point) constrain by timestamp.

If you've used webhooks before, you already know how it goes. Ready a Trigger SC with the Campaign is Requested trigger:

ss

… whose Flow calls the webhook:

ss

… and have a Batch SC call the Trigger SC for the leads in question:

ss

… and, finally, process results with Webhook is Called:

ss

In the last trigger, I'm matching the response string against

"language":"zh-

because I want to catch both "language": "zh-TW" (Chinese-Taiwan) and "language": "zh-CN" (Chinese-China).

And also note (discovered this only recently with a client) that even though the literal JSON response is

"language": "zh-TW"

Marketo removes the spaces (this is a bug IMO, as the response should be considered a literal string) so you have to match on:

"language":"zh-TW"

Big ole “BUT”

Here's the problem, as foreshadowed above: a Pinyin (transliterated/romanized) name like “Zhou Yougang” is (correctly) regarded as Chinese. From the Translate API's standpoint, it's indistinguishable from the same words written in Chinese characters.

And even in a monolingual English-speaking company, you'll have (or should hope to have!) a diverse customer base, so you shouldn't discard a lead merely because their name has roots in another language/region. The fact that they're using Pinyin — or Romaji for Japanese — may or may not signify that they're prepared to communicate in English. You can't know for sure (Japanese elementary schools teach Romaji, for example, so it's certainly not reserved for bilingual folks and the lead may be non-actionable).

So while I think the Google Translate API holds great promise for other purposes, it turns out to be not-quite-right for bot detection. In Part II, I'll show you how to use FlowBoost to accurately do the not-readable-here glyph detection we're looking for.


Notes

[1] Form labels aren't posted to the server, so Marketo can't know whether the label was in the same language as the value or not. Otherwise that would be a pretty good clue to bot activity.

[2] While Google's API is dirt-cheap for this need, it isn't free. But at $20.00 for every 1,000,000 characters it's a laughably reasonable price point, especially for a one-time cleanup.