The Salesforce (and Marketo) “URL” type supports Unicode: make sure your data warehouse knows this!

It’s well-established that SFDC’s URL field type isn’t up to the task of storing real-world URLs, since it maxes out at 255 characters.[1]

What isn’t discussed much, if at all: though RFC 3986-style URIs are ASCII-only, SFDC’s version of a “URL” isn’t limited to ASCII.

Indeed, a URL field can contain an IRI — a.k.a. a Fetch-era internationalized URL — whose Unicode characters haven’t been percent-encoded. (And it can be smart to store IRIs: as I mentioned in the recent post on decoding URLs you’re only storing for attribution reporting, IRIs are highly space-efficient.)

It bears repeating that there’s a clear career advantage to knowing your way around text encoding. You don’t want to be the one — and I’ve been the one myself, to my lasting shame — saying “Oops, we’ll never know what those ???? or ���� or ÿÿÿÿ characters were in the original data.”

When you pull data into your on-premises db, you need to know what you’re gonna get.

Don’t make an Ass out of U(nicode) and Me

Earlier today I was auditing the column types in one of my clients’ SQL databases, which they use to store backed-up SFDC data. (Not Marketo data in this case.)

They were exceeding MSSQL’s max row length of 8060 bytes on occasion, so the goal was to find some columns that were mistakenly using NVARCHAR, when they could’ve used a narrower VARCHAR based on the data being stored.

NVARCHAR takes 2x the bytes to store ASCII data. So if you know — and you do have to be 100% sure! — that all values are ASCII-only, now and forever, you can switch ’em to VARCHAR.[2][3]

General examples are Phone fields and Email fields, where the business meaning of the fields allows you to validate that they are ASCII-only on the way in. An SFDC-specific example is an ID type field, which is always 18 alphanumeric ASCII characters. You’re wasting 18 bytes by using an NVARCHAR for those.

I found some low-hanging narrowable fruit, but also discovered one field that may need to go in the other direction:

The database schema was automatically created by the (commercial) software that backs up SFDC, so it’s not anyone’s fault internally. But if a salesperson or integrated app happened to put an IRI into the Website field, SFDC would let that happen, and then data would be lost on the way into the warehouse.

So for this field value in SFDC:

You’d get this value in the database:

Luckily, I couldn’t see a single Contact where this had happened yet. But if they start deliberately storing IRIs to fit more in the 255-character limit, the SQL column must be widened to NVARCHAR(255).

So keep this case in mind and/or share it with your DBA!

Notes

[1] Which also affects the corresponding Marketo type, unfortunately.
[2] This client is using SQL Server 2017. SQL Server 2019 introduced VARCHAR columns with UTF-8 support, but jury’s still out on whether UTF-8 VARCHARs are even a good idea.
[3] “Switch” meaning create a new column, copy the values, drop the original column, and change the column name. Simple, but not actually a single step.