Unicode and UTF-8 (and emojis)

Doing some research for my bush hid the facts article led me to read about Unicode, as well as the various ways to encode Unicode code points. I didn’t really look into the Unicode standard (it looked too complicated), but I still wanted to go back and have a look at the nitty gritty details at some point.

I only knew that UTF-8 was a 21-bit encoding, and that it could encode any code point in 1 to 4 bytes.

21-bit encoding: this means that UTF-8 can encode 21 bits worth of information by using 1 to 4 bytes. However, this does not mean that Unicode numbers go up to 2,097,151! UTF-8 can encode numbers outside of the Unicode character range
code point: this is what a “character” is called in Unicode. Each code point corresponds to a number. Most Western characters and simple emojis are a single code point: for example “🤓” (called NERD FACE), has code point number 129299, or U+1F913. However, some CJK characters and complex emojis are made up of multiple code points (we’ll see that later)

There are three main advantages to UTF-8 over UTF-16:

No encoded code point will ever produce a NULL byte: preserves compatibility with C-style strings
Fully backwards-compatible with 7-bit ASCII: a plain old ASCII text is also valid UTF-8, representing the same text
For Western languages, UTF-8-encoded text uses roughly half as much memory as UTF-16

Then, I stumbled upon “UTF-8: Bits, Bytes, and Benefits” when it was shared on lobste.rs. It presented in three simple rules how UTF-8 encodes numbers. The information jumped at me, seemingly clear as day. Oops, did I just learn how to encode and decode UTF-8 by accident?

The algorithm is simple:

We’re working on the binary representation of the Unicode code point
Each byte holds 6 bits worth of information, and 2 “continuation” bits (10) at the start
The first byte starts with as many 1s as there are bytes, followed by a 0, followed by any leftover data we might need to pack

I fired up Python, and tried some magic: my goal was to display the “😊” emoji. Emojipedia tells me this is Unicode code point U+1F60A.

>>> bin(0x1F60A)
'0b11111011000001010'
#  < c >< b  >< a  >  
>>> a = 0b10001010 # "10" + data
>>> b = 0b10011000 # "10" + data
>>> c = 0b10011111 # "10" + data
>>> d = 0b11110000 # "this is going to be 4 bytes" + no data
>>> bytearray([d, c, b, a]).decode("utf8")
'😊'

The process works! Armed with this new knowledge, I wanted to dig deeper into how various characters were encoded, so I built another small Javascript web app, this time to decompose UTF-8-encoded text in their respective code points, detailing for each one how the algorithm works.

It’s just called “UTF-8 playground”, and is available at /utf8

It’s a playground, let’s play! In the next few sections, I’ll illustrate how emoji construction works by using this tool to split up popular emojis into their various code points.

A note about emojis

“Emojis” as they are understood in the West really came to popularity after they were added to Unicode in 2010 (Wikipedia). For example, Android unveiled emojis with the KitKat (4.4) release, in 2013.

All of the emojis added in Unicode 6.0 were single-code-point emojis, meaning that the Unicode organization assigned a different, unique number to each one. For example, the four heart colors (💛💚💙💜) each have their own code point.

We will see that there are other ways to make emojis!

Single code point emoji

Let’s analyze the cool emoji: 😎. Here’s what the playground has to say about it:

Character	Bytes	Meaning	Tally
😎 U+1F60E	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x98	10011000 (continuation + add 011000)	000011111011000
	0x8E	10001110 (continuation + add 001110)	000011111011000001110 => 0x1F60E

It is indeed made up of a single character (code point), with number 0x1F60E. That’s a big number, so UTF-8 uses 4 bytes to encode it, but it’s still a single Unicode character.

Combining emojis

Next up, let’s have a look at a country flag emoji: 🇫🇷. (note: that’s supposed to be the French flag. If you’re seeing something else, your OS spoiled the answer for you)

Character	Bytes	Meaning	Tally
🇫 U+1F1EB	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x87	10000111 (continuation + add 000111)	000011111000111
	0xAB	10101011 (continuation + add 101011)	000011111000111101011 => 0x1F1EB
🇷 U+1F1F7	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x87	10000111 (continuation + add 000111)	000011111000111
	0xB7	10110111 (continuation + add 110111)	000011111000111110111 => 0x1F1F7

Interesting! Instead of assigning a code point to each country flag (WorldPopulationReview lists 197 sovereign nations, plus 15 of disputed sovereignity), the Unicode consortium decided to assign an emoji to each 26 letter of the latin alphabet, and simply recommends combining them using the ISO 3166-1 alpha-2 standard to encode the flags. Combining literally means “putting them next to one another”. Hit “backspace” next to a flag, and some software will just pop up the remaining letter.

Here’s the full list of emojis in Unicode 15.1. Lookup “RGI_Emoji_Flag_Sequence” to get to the flag list. You’ll see that most of them are made up of two code points.

A funny note is that country flag emojis do not work on Windows: I don’t believe there has ever been a public explanation, but rumor has it that it was easier to just not ship this feature than have to censor the Taiwanese flag for the Chinese market.

This juxtaposition is also how “skin tone” modifiers work: 🏻 🏼 🏽 🏾 🏿 just get plopped down after any 🤚 and boom: 🤚🏿

The final boss of country flag emojis: 🏴󠁧󠁢󠁳󠁣󠁴󠁿 (this seriously messes up my terminal)

Literally [FLAG]GBSCT[CANCEL]

Special case of combining emojis: emoji promotions

Unicode already had “symbols” way before emojis came out: they were usually low-res and in black and white, but they were there! The Miscellaneous block is full of them: ☀, ⛏, and others.

There would have been two bad ways to handle this:

Create new code points for these, and have one for the black-and-white version, and one for the “emoji” version
Update the definition of the original symbols to require colour

The first one introduces confusion (which one are you supposed to use?), and the second one is not reassuring regarding the stability of the Unicode standard.

A third way was chosen: ☀️

Character	Bytes	Meaning	Tally
☀ U+2600	0xE2	11100010 (3 bytes announcer + add 0010)	0010
	0x98	10011000 (continuation + add 011000)	0010011000
	0x80	10000000 (continuation + add 000000)	0010011000000000 => 0x2600
VS16 U+FE0F	0xEF	11101111 (3 bytes announcer + add 1111)	1111
	0xB8	10111000 (continuation + add 111000)	1111111000
	0x8F	10001111 (continuation + add 001111)	1111111000001111 => 0xFE0F

The “Variation Selectors” are a list of 16 tags, which apply to the character before them and inform on which variation to display. Variation 16 means “show me the emoji variant”.

So, ☀ by itself it the old symbol, but append U+FE0F and it becomes ☀️ !!

Zero Width Joiner combinations

The Zero Width Joiner is code point U+200D, and is used to link together two or more characters. It is used to encode complex scripts, such as Arabic. It is also a big part of how newer emojis are encoded!

Let’s take the example of 👨‍🦯 (“man with white cane”). We already have Man👨, and we also have White Cane🦯, so instead of allocating a brand new code point, why not combine them?

Character	Bytes	Meaning	Tally
👨 U+1F468	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x91	10010001 (continuation + add 010001)	000011111010001
	0xA8	10101000 (continuation + add 101000)	000011111010001101000 => 0x1F468
ZWJ U+200D	0xE2	11100010 (3 bytes announcer + add 0010)	0010
	0x80	10000000 (continuation + add 000000)	0010000000
	0x8D	10001101 (continuation + add 001101)	0010000000001101 => 0x200D
🦯 U+1F9AF	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0xA6	10100110 (continuation + add 100110)	000011111100110
	0xAF	10101111 (continuation + add 101111)	000011111100110101111 => 0x1F9AF

I think that the way this works is the ZWJ implies a with? E.g. a man with a cane, or a flag with a rainbow. Also, it allows people to write stuff like 👨💻 without it turning into 👨‍💻 automatically.

What is the biggest we can make an emoji (in bytes)?

Of course, I want to see where the limit is. Looking up this question on the Internet drove me to this blog post, which concludes that the biggest they could find was 👨🏻‍❤️‍💋‍👨🏻, at 35 bytes.

The blog post is from last year, and mentions Unicode 15. We are still at Unicode version 15, albeit 15.1.0. This small update does not appear to add any new huge combination of emojis…so are we done?

I don’t think so! It appears that there is at least another bigger emoji: a custom family!

ZWJs allows us to compose our ideal family: by asking for two adults, two children, and specifying each one’s skin tone…

Character	Bytes	Meaning	Tally
👨 U+1F468	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x91	10010001 (continuation + add 010001)	000011111010001
	0xA8	10101000 (continuation + add 101000)	000011111010001101000 => 0x1F468
🏿 U+1F3FF	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x8F	10001111 (continuation + add 001111)	000011111001111
	0xBF	10111111 (continuation + add 111111)	000011111001111111111 => 0x1F3FF
ZWJ U+200D	0xE2	11100010 (3 bytes announcer + add 0010)	0010
	0x80	10000000 (continuation + add 000000)	0010000000
	0x8D	10001101 (continuation + add 001101)	0010000000001101 => 0x200D
👨 U+1F468	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x91	10010001 (continuation + add 010001)	000011111010001
	0xA8	10101000 (continuation + add 101000)	000011111010001101000 => 0x1F468
🏿 U+1F3FF	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x8F	10001111 (continuation + add 001111)	000011111001111
	0xBF	10111111 (continuation + add 111111)	000011111001111111111 => 0x1F3FF
ZWJ U+200D	0xE2	11100010 (3 bytes announcer + add 0010)	0010
	0x80	10000000 (continuation + add 000000)	0010000000
	0x8D	10001101 (continuation + add 001101)	0010000000001101 => 0x200D
👦 U+1F466	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x91	10010001 (continuation + add 010001)	000011111010001
	0xA6	10100110 (continuation + add 100110)	000011111010001100110 => 0x1F466
🏿 U+1F3FF	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x8F	10001111 (continuation + add 001111)	000011111001111
	0xBF	10111111 (continuation + add 111111)	000011111001111111111 => 0x1F3FF
ZWJ U+200D	0xE2	11100010 (3 bytes announcer + add 0010)	0010
	0x80	10000000 (continuation + add 000000)	0010000000
	0x8D	10001101 (continuation + add 001101)	0010000000001101 => 0x200D
👧 U+1F467	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x91	10010001 (continuation + add 010001)	000011111010001
	0xA7	10100111 (continuation + add 100111)	000011111010001100111 => 0x1F467
🏿 U+1F3FF	0xF0	11110000 (4 bytes announcer + add 000)	000
	0x9F	10011111 (continuation + add 011111)	000011111
	0x8F	10001111 (continuation + add 001111)	000011111001111
	0xBF	10111111 (continuation + add 111111)	000011111001111111111 => 0x1F3FF

That’s a whopping 41 bytes!! We see both ways of associating code points together:👨🏿‍👨🏿‍👦🏿‍👧🏿

The skin tone for each family member is right after the “man”, “boy” or “girl” emoji
They are all joined together with ZWJs

Update: Wow, this emoji is very inconsistent! It works fine on Firefox on Windows 10, but renders as 4 different 👨🏿 on iOS for example. I suppose custom families aren’t completely standardized yet?

What’s next?

Well, writing all this made me want for a way to easily combine code points together, in addition to one to split them up. So look forward to that I guess.

SixFoisNeuf

Totally irregular blog on computers and security