Friendly Binary-to-text Encodings

Summary: Choosing a text encoding for friendly, easy-to-say, difficult-to-misunderstand identifiers can be tricky, with important choices in density and characters.

Assumed audience: folks at least somewhat familiar with character sets and encoding data.If this doesn't describe you, don't lose heart! The world rewards curiosity.

Tags: math, computers

Created: 14 April 2024

I needed to create a bunch of identifiers that people were going to sometimes enter manually into a computer, or read to each other before entering them into a computer. For my situation, ideally, these identifiers would:

handle ambiguous characters
- people shouldn’t need to worry about the numeral 1 vs the letter i vs the letter l
be allowable in URLs and in file paths in common file systems
- don’t use / or \, and don’t use characters reserved in URLs like + or ?
be quick and easy to say out loud
- using mixed-case letters means people would have to say “uppercase j, 3, lowercase q, uppercase N”
- fewer characters is generally better

Approach

This group of goals isn’t unusual. I knew I wasn’t the first to have them. Sometimes, I want to dig right in and solve the problem. Sometimes, I want to skim the literature and grab a solution. Other times, I want to go deeper and understand the goals and constraints that drove the existing solutions. This can be really rewarding. Comparing goals, constraints, and approaches can reveal parts of your problem you didn’t know you had For instance, some of these systems were designed to never produce particular words and others were designed not to make words at all. Upon reflection, I still don't think it's a goal here, but it's nice to have considered it. , but it often rewards in “non-productive” ways, too. It’s fun, and you’ll never run out of things to talk about at parties.

Wikipedia has a decent collection of ways to encode binary data into text. Some of them overlap with our goals. For instance, some encodings explicitly cite ambiguous characters when explaining how the alphabet was chosen. The alphabet, or set of symbols used in an encoding system, is sometimes called the “dictionary” or the “character set”. Some try to handle character ambiguity by skipping characters that are easily confusable. Others specify a canonical encoding character for each group of confusable characters, and specify that all characters in the group decode to the same value.

There are many encoding systems. I collected my notes on some and have summarized them below.

Terminology

These ideas have been around for a long time and have spawned many variants. Terminology tends to be a little sloppy, unfortunately. For instance, the term “Base64” is often used to refer to any number of encodings that use 64 printable characters, and the phrase “base 64” is used to refer to a particular encoding called Base64.

RFC 4648 specifies how to encode input into Base32, specifying not just “base”, but also padding, alignment, the characters with their values, and other details. You could come up with an alternate alphabet, for instance, and while it would use base 32, it wouldn’t be “base32”. “Decimal” and “hexadecimal” don’t quite belong in the same category as “Base64” and “Crockford’s Base32”. (Note: RFC 4648 also defines “base16” and it’s what you’d expect—0–9, A–F.)

Decimal

My default option for encoding numbers for people is decimal. Decimal numerals (or decimals) are quick to say, easy to type, allowed in URLs and file paths, and already invented. They’re a good baseline.

Notably, using decimals doesn’t even require knowledge of the English alphabet.

Decimals aren’t dense compared to other choices. With only ten options per character, four digits in base 10 get us from 0 to 9,999, covering 10,000 values.

Examples of decimal identifiers include 42, 100, and 01189998819991197253.

Hexadecimal

Another option is hexadecimal, or base 16. Hexadecimal usually uses A–F as the digits beyond 0–9. These characters are unambiguous, easy to type, and are allowed in URLs and file paths.

Hexadecimal doesn’t require mixed-case characters, so no one needs to signify cases like “uppercase A, 7, lowercase b”.

With decimal, there are ways to read larger numbers, like “one thousand, three hundred and thirty-seven.” So far, none of the proposals for pronouncing hexadecimals have caught on (1, 2), so we’re limited to reading the characters one at a time. In the show "Silicon Valley", Erlich Bachman says "Ask me what nine times F is. It’s fleventy-five." Nine times 0xF is 0x87. It doesn't make obvious sense to me that you'd pronounce 0x87 as fleventy-five (unlike, say, maybe 0xF5). This might be a joke playing on Erlich's abilities; it may just be a mistake—either way, it's funny how long I spent thinking about it.

Sometimes, to distinguish hexadecimal from decimal or other bases, hexadecimal is written with a leading 0x, like 0x61 or 0x4D2. There are other programming notations, like an h suffix or a $ prefix, and in math, you may see a subscript after parenthesis, like (1012)₁₆, but none of these would be necessary for these identifiers.

Many English words can be made with the traditional hexadecimal alphabet, Lots of folks know about the BAD D00Ds and BABEs at the CAFE, drinking C0FFEE (even DECAF!). The café serves F00D, too, like BEEF, but few people know that the café has great F0CACC1A and FA1AFE1. especially when you allow for letter substitutions, like 0 for O.

With 16 options per character, hexadecimal is denser than decimal. Four hexadecimal characters cover 65,536 values, from 0x0000 to 0xFFFF.

Examples of hexadecimal identifiers include 35, 4D2, and BADDECAFC0FFEE.

Base64

Base64 is commonly used to transmit binary data via ASCII. To get 64 easily typed symbols, we have to use at least some letters in both upper- and lowercase, which makes it more awkward to read out than previous options.

There are many variations in what folks call “Base64”, but the “standard” “Base64” is defined in RFC 4648. The normal alphabet has / in it, which can be a pain in URLs and filenames.

Base64url, an alternate standard also proposed in RFC 4648, becomes file- and URL-safe by replacing + and / with - and _. It still has upper- and lowercase letters. Eliminating + and / make it easier to use in URLs and file systems, but distinguishing - and _ may be problematic! (There are other standard alphabets, too.)

Base64 is quite dense. Four characters of Base64 cover 16,777,216 values, from AAAA to ////.

Examples of Base64 include NDI=, MTAw, and MTIzNA==.

Base32

Base32 seems to be closer to our needs. It uses 32 symbols. There are a few standard dictionaries, and nothing, really, stopping you from creating your own. Base32 variant dictionaries often only contain one letter case, omit /, and some skip easily confused pairs, like 1 and lowercase l. The “standard” Base32 definition is in RFC 4648. The alphabet contains A–Z, 2–7, skipping 0 and 1.

The RFC also defines “base32hex” with an alphabet of 0–9, A–V. Base32hex has all the numerals in it, and, unlike Base32, when compared bit-wise, the encoded data sorts the same as the decoded data.

With 32 options per character, a four-character Base32 string covers 1,048,576 values.

Examples of Base32 identifiers include 2C45, 42, and WORD.

Crockford’s Base32

“Crockford’s Base32” excludes I, L, and O for ambiguity, and U for “accidental obscenity”. It also distinguishes decode symbols from encode symbols, allowing someone to enter a 1, i, I, L, or l, and they all mean the same thing—but it will re-encode that symbol to a 1. Crockford’s Base32 also allows for an optional checksum, which requires an extra five symbols, *, ~, $, =, and U.

Like the other Base32 variants here, with 32 options per character, a four-character Crockford’s Base32 string covers 1,048,576 values.

Examples of Crockford’s Base32 identifiers include 6GS0 and 64SK6DR.

Base32H

Base32H is a newer proposal, similar to Crockford’s Base32 in a lot of ways. Base32H was designed assuming it’s always encoded into numerals and uppercase letters. Due to the uppercase assumption, while I and 1 are aliased together, L isn’t included. It also aliases U with V, 1 with I, and 0 with O. It doesn’t have a built-in mechanism for checksums. The canonical uppercase encoding may make it awkward in some places, like URLs.

Like the other Base32 variants here, with 32 options per character, a four-character Base32H string covers 1,048,576 values.

Examples of Base32H include 1A and 19R, but any string of 0–9 and A–Z (and a–z) can be decoded using Base32H.

Base20 and Open Location Codes

Google created the “Open Location Code” (also known as Plus Code) geocode I could get lost reading about geocodes; they're so interesting! See also "discrete global grid". as an alternative to latitude and longitude coordinates.

Open Location Codes use a base 20 encoding. They use 2 through 9, C, F, G, H, J, M, P, Q, R, V, W, and X. Doug Rinckes on Google’s Travel team says:

[…] the character set was chosen to avoid spelling words in more than 30 different languages. We removed similar looking characters to reduce confusion and errors, and because they aren’t case-sensitive, they can be easily exchanged over the phone.

— Open Location Code: Addresses for everything, everywhere

Avoiding spelling words across many languages wasn’t one of my goals, but it was interesting enough to include in this round-up!

Open Location Codes are more complicated than just using that alphabet as a base 20 encoding, but an example Open Location Code is 86P8XPC6+39.

With 20 options per character, four digits in base 20 cover 160,000 values.

Examples of base 20 identifiers using this character set include 555, X0X0, and MP303.

“Word-safe” Base32

If you take the characters from Open Location Code from before, and use both uppercase and lowercase versions of the letters, you get 32 characters. This has been called a “word-safe” Base32.

This is mixed-case, of course, but the amount it tries to avoid words seems impressive.

Like the other Base32 variants here, with 32 options per character, a four-character “word-safe” Base32 string covers 1,048,576 values.

Examples of “word-safe” Base32 identifiers include 23 and j37xPX.

Potentially ambiguous character pairings

System	0/Oo	1/Ii	1/Ll	Ii/Ll	8/Bb	Uu/Vv	5/Ss
Decimal
Hexadecimal					❌
Base64	❌	❌	❌	❌	❌	❌	❌
base64url	❌	❌	❌	❌	❌	❌	❌
Open Location Code Base 20
RFC 4648 Base32						❌	❌
Crockford’s Base32					❌		❌
Base32H			❌		❌
“Word-safe” Base32

Each column represents a character pairing, like 0 and O/o, and each row represents a particular system. If a cell contains ❌, that row’s system distinguishes between characters in the first group from characters in the second group. A row with many ❌s is more likely to generate confusing outputs (depending upon context like how the output is presented, the font face, the size, and how the media wears over time).

Output Characters

System
Decimal	`0` `1` `2` `4` `5` `6` `7` `8` `9`
Hexadecimal	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `B` `C` `D` `E` `F`
RFC 4648 Base64	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `a` `B` `b` `C` `c` `D` `d` `E` `e` `F` `f` `G` `g` `H` `h` `I` `i` `J` `j` `K` `k` `L` `l` `M` `m` `N` `n` `O` `o` `P` `p` `Q` `q` `R` `r` `S` `s` `T` `t` `U` `u` `V` `v` `W` `w` `X` `x` `Y` `y` `Z` `z` `+` `/` (`=`)
Base64url	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `a` `B` `b` `C` `c` `D` `d` `E` `e` `F` `f` `G` `g` `H` `h` `I` `i` `J` `j` `K` `k` `L` `l` `M` `m` `N` `n` `O` `o` `P` `p` `Q` `q` `R` `r` `S` `s` `T` `t` `U` `u` `V` `v` `W` `w` `X` `x` `Y` `y` `Z` `z` `-` `_` (`=`)
RFC 4648 Base32	`2` `4` `5` `6` `7` `A` `B` `C` `D` `E` `F` `G` `H` `J` `K` `M` `N` `P` `Q` `R` `S` `T` `V` `W` `X` `Y` `Z` (`=`)
Base32hex	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `B` `C` `D` `E` `F` `G` `H` `J` `K` `L` `M` `N` `P` `Q` `R` `T` `V` (`=`)
Crockford’s Base32	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `B` `C` `D` `E` `F` `G` `H` `J` `K` `M` `N` `P` `Q` `R` `S` `T` `V` `W` `X` `Y` `Z` (`*` `~` `$` `=` `U`)
Base32H	`0` `1` `2` `4` `5` `6` `7` `8` `9` `A` `B` `C` `D` `E` `F` `G` `H` `J` `K` `L` `M` `N` `P` `Q` `R` `T` `V` `W` `X` `Y` `Z`
Open Location Code Base 20	`2` `4` `5` `6` `7` `8` `9` `C` `F` `G` `H` `J` `M` `P` `Q` `R` `V` `W` `X`
“Word-safe” Base32	`2` `4` `5` `6` `7` `8` `9` `C` `c` `F` `f` `G` `g` `H` `h` `J` `j` `M` `m` `P` `p` `Q` `q` `R` `r` `V` `v` `W` `w` `X` `x`

Conclusions

Remember:

This wasn’t an exhaustive list.
Your needs are unlikely to be exactly the same as mine.
Make sure to review your output font while looking at the character tables.

Both Crockford’s Base32 and Base32H satisfy my original goals reasonably well, but I’m slightly partial to Crockford’s Base32.

Base32H has a lot of features that seem nice—like “every string of alphanumerics is a valid input”—but I’m not sure that I’ve needed them. In general, I like the alphabet used in Crockford’s Base32 a little better than Base32H’s. Compared to Crockford’s Base32, Base32H drops S (to reduce confusion with 5) and allows both L and 1, arguing that by telling folks to always output capital letters, there’s no ambiguity between L and 1.

Ultimately, breaking down this problem into goals and constraints and comparing existing solutions through that lens proved enlightening.

14 April 2024
- math,
- computers

If this was helpful or enjoyable, please share it! To get new posts, subscribe to the newsletter or the RSS/Atom feed. If you have comments, questions, or feedback, please email me.